{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1><center>cs1001.py , Tel Aviv University, Fall 2020</center></h1>\n",
    "<img src=\"http://www.pngall.com/wp-content/uploads/2016/05/Python-Logo-PNG-Image-180x180.png\" width=50/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Recitation 12\n",
    "\n",
    "We discussed two text-compression algorithms: Huffman coding and Lempel-Ziv."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Takeaways:\n",
    "- Huffman Coding is a lossless compression technique. Given a corpus, a character frequency analysis is performed and a prefix free tree is built according to the analysis\n",
    "- Given a text, the characters are encoded using the tree\n",
    "- Given an encoding, decoding is performed in a \"greedy\" manner (for this reason a prefix free encoding is crucial)\n",
    "- While Huffman coding is optimal, if the character frequency of the corpus differs from that of the encoded text, the compression can be inefficient"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Code for printing several outputs in one cell (not part of the recitation):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.core.interactiveshell import InteractiveShell\n",
    "InteractiveShell.ast_node_interactivity = \"all\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Huffman Coding\n",
    "\n",
    "Generates a variable-length, prefix-free code, based on the character frequencies in corpus string.\n",
    "Notice that there are several degrees of freedom that could lead to different tree structures."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"huffman.PNG\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def ascii2bit_stream(text):\n",
    "    \"\"\" Translates ASCII text to binary reprersentation using\n",
    "        7 bits per character. Assume only ASCII chars \"\"\"\n",
    "    return \"\".join([bin(ord(c))[2:].zfill(7) for c in text])\n",
    "\n",
    "\n",
    "\n",
    "########################################################\n",
    "#### HUFFMAN CODE\n",
    "########################################################\n",
    "\n",
    "def char_count(text):\n",
    "    \"\"\" Counts the number of each character in text.\n",
    "        Returns a dictionary, with keys being the observed characters,\n",
    "        values being the counts \"\"\"\n",
    "    d = {}\n",
    "    for ch in text:\n",
    "        if ch in d:\n",
    "            d[ch] += 1\n",
    "        else:\n",
    "            d[ch] = 1\n",
    "    return d\n",
    "\n",
    "\n",
    "def build_huffman_tree(char_count_dict):\n",
    "    \"\"\" Recieves dictionary with char:count entries\n",
    "        Generates a LIST structure representing\n",
    "        the binary Huffman encoding tree \"\"\"\n",
    "    queue = [(c,cnt) for (c,cnt) in char_count_dict.items()]\n",
    "\n",
    "    while len(queue) > 1:\n",
    "        #print(queue)\n",
    "        # combine two smallest elements\n",
    "        A, cntA = extract_min(queue)    # smallest in queue\n",
    "        B, cntB = extract_min(queue)    # next smallest\n",
    "        chars = [A,B]\n",
    "        weight = cntA + cntB            # combined weight\n",
    "        queue.append((chars, weight))   # insert combined node\n",
    "\n",
    "    # only root node left\n",
    "    #print(\"final queue:\", queue)\n",
    "    root, weight_trash = extract_min(queue) # weight_trash unused\n",
    "    return root                             #a LIST representing the tree structure\n",
    "\n",
    "\n",
    "def extract_min(queue): \n",
    "    \"\"\" queue is a list of 2-tuples (x,y).\n",
    "        remove and return the tuple with minimal y \"\"\" \n",
    "    min_pair = min(queue, key = lambda pair: pair[1])\n",
    "    queue.remove(min_pair)\n",
    "    return min_pair\n",
    "\n",
    "\n",
    "\n",
    "def generate_code(huff_tree, prefix=\"\"):\n",
    "    \"\"\" Receives a Huffman tree with embedded encoding,\n",
    "        and a prefix of encodings.\n",
    "        returns a dictionary where characters are\n",
    "        keys and associated binary strings are values.\"\"\"\n",
    "    if isinstance(huff_tree, str): # a leaf\n",
    "        return {huff_tree: prefix}\n",
    "    else:\n",
    "        lchild, rchild = huff_tree[0], huff_tree[1]\n",
    "        codebook = {}\n",
    "\n",
    "        codebook.update(generate_code(lchild, prefix+'0'))\n",
    "        codebook.update(generate_code(rchild, prefix+'1'))\n",
    "        #   oh, the beauty of recursion...\n",
    "        return codebook\n",
    "\n",
    "    \n",
    "def compress(text, encoding_dict):\n",
    "    \"\"\" compress text using encoding dictionary \"\"\"\n",
    "    assert isinstance(text, str)\n",
    "    return \"\".join(encoding_dict[ch] for ch in text)\n",
    "\n",
    "\n",
    "def reverse_dict(d):\n",
    "    \"\"\" build the \"reverse\" of encoding dictionary \"\"\"\n",
    "    return {y:x for (x,y) in d.items()}\n",
    "\n",
    "\n",
    "def decompress(bits, decoding_dict):\n",
    "    prefix = \"\"\n",
    "    result = []\n",
    "    for bit in bits:\n",
    "        prefix += bit\n",
    "        if prefix in decoding_dict:\n",
    "            result.append(decoding_dict[prefix])\n",
    "            prefix = \"\"  #restart\n",
    "    assert prefix == \"\" # must finish last codeword\n",
    "    return \"\".join(result)  # converts list of chars to a string"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The Huffman Encoding Process"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "s = \"live and let live\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Frequency dictionary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'l': 3, 'i': 2, 'v': 2, 'e': 3, ' ': 3, 'a': 1, 'n': 1, 'd': 1, 't': 1}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "count_dict = char_count(s)\n",
    "count_dict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Build a Huffman tree:\n",
    "\n",
    "The function returns a list which represents a binary tree as follows:\n",
    "* A string in the list is a leaf\n",
    "* Otherwise, a list of size two represents two sub-trees: the left subtree and the right subtree\n",
    "\n",
    "The set of indices that lead to a leaf (char) correspond to the path leading to the leaf, and therefore to the encoding of the character in the final Huffman code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[' ', ['i', 'v']], [[['a', 'n'], ['d', 't']], ['l', 'e']]]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "[' ', ['i', 'v']]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "[[['a', 'n'], ['d', 't']], ['l', 'e']]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "' '"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "'v'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "huff_tree = build_huffman_tree(count_dict)\n",
    "\n",
    "# A list with two elements\n",
    "huff_tree\n",
    "\n",
    "# The left subtree of the root\n",
    "huff_tree[0]\n",
    "\n",
    "# The right subtree of the root\n",
    "huff_tree[1]\n",
    "\n",
    "# Leaves correspond to characters in the final code\n",
    "huff_tree[0][0]\n",
    "huff_tree[0][1][1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generate binary representation for each char in the corpus based on the tree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{' ': '00',\n",
       " 'i': '010',\n",
       " 'v': '011',\n",
       " 'a': '1000',\n",
       " 'n': '1001',\n",
       " 'd': '1010',\n",
       " 't': '1011',\n",
       " 'l': '110',\n",
       " 'e': '111'}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d = generate_code(huff_tree)\n",
    "d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compressing some text using our code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "text1 = \"tell\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'1011111110110'"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "13"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compress(text1, d)\n",
    "len(compress(text1, d))#4+3+3+3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "28"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unicode_conversion = ascii2bit_stream(\"tell\") #when we convert every character to 7 bits\n",
    "len(unicode_conversion)#4*7"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Decoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'00': ' ',\n",
       " '010': 'i',\n",
       " '011': 'v',\n",
       " '1000': 'a',\n",
       " '1001': 'n',\n",
       " '1010': 'd',\n",
       " '1011': 't',\n",
       " '110': 'l',\n",
       " '111': 'e'}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "e = reverse_dict(d)\n",
    "e"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'tell'"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "decompress('1011111110110', e)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Handling missing characters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\x00\\x01\\x02\\x03\\x04\\x05\\x06\\x07\\x08\\t\\n\\x0b\\x0c\\r\\x0e\\x0f\\x10\\x11\\x12\\x13\\x14\\x15\\x16\\x17\\x18\\x19\\x1a\\x1b\\x1c\\x1d\\x1e\\x1f !\"#$%&\\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\\x7f'"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "asci = str.join(\"\",[chr(c) for c in range(128)])\n",
    "asci"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Computing the compression ratio"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compression_ratio(text, corpus):\n",
    "    d = generate_code(build_huffman_tree(char_count(corpus)))\n",
    "    len_compress = len(compress(text, d))\n",
    "    len_ascii = len(ascii2bit_stream(text)) #len(text)*7\n",
    "    return len_compress/len_ascii"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.14285714285714285"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "0.14285714285714285"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compression_ratio(\"ab\", \"ab\")\n",
    "1/7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.2857142857142857"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "0.2857142857142857"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compression_ratio(\"ab\", \"abcd\")\n",
    "2/7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.1428571428571428"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "1.1428571428571428"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compression_ratio(\"hello\", \"a\"*100 + asci)\n",
    "8/7"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Alternative Huffman Trees (2014aa)\n",
    "\n",
    "Two Huffman trees will be called *alternative trees* if they are generated by the same corpus but the sequence of coding lengths they generate is different.\n",
    "\n",
    "For example, given the corpus ```abcdd``` the following trees are alternative:\n",
    "\n",
    "<img src=\"alt_trees.PNG\">\n",
    "\n",
    "Since the sequence of lengths they generate is $2, 2, 2, 2$ and $3, 3, 2, 1$ respectively, and the following trees are *not* alternative:\n",
    "\n",
    "<img src=\"not_alt_trees.PNG\">\n",
    "\n",
    "Since they all generate trees with coding lengths $3, 3, 2, 1$\n",
    "\n",
    "Note: this definition does not address our Python implementation of the algorithm but instead refers to an algorithm where at each point two nodes of minimal weights are extracted arbitrarily.\n",
    "\n",
    "* Show two alternative trees for the corpus ```abcdee```.\n",
    "\n",
    "Answer:\n",
    "\n",
    "<img src=\"ans_alt_trees.PNG\">\n",
    "\n",
    "* Prove or disprove: there exists a corpus of five distinct letters that can generate the following alternative trees. If true, give such a corpus, if false, explain why this is impossible:\n",
    "\n",
    "<img src=\"three_trees.PNG\">\n",
    "\n",
    "Answer: the claim is true, and one such corpus is ```abcddeee```\n",
    "\n",
    "* Prove or disprove: if a corpus has *unique* frequencies then it cannot have two alternative trees.\n",
    "\n",
    "Answer: the claim is false. Consider the corpus ```aabbbccccddddd``` (with unique frequencies $2, 3, 4, 5$) and note that it can generate the following alternative trees:\n",
    "\n",
    "<img src=\"alt_trees.PNG\">\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lempel-Ziv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The LZ compression algorithm works by encoding succinct representations of repetitions in the text. A repetition of the form ```[offset,length]```, means that the next ```length``` characters appeared ```offset``` characters earlier.\n",
    "\n",
    "- It might be that: offset < length\n",
    "- Usually, offset < 4096 and length < 32"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"lz.PNG\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "import math\n",
    "\n",
    "def str_to_ascii(text):\n",
    "    \"\"\" Gets rid of on ascii characters in text\"\"\"\n",
    "    return ''.join(ch for ch in text if ord(ch)<128)\n",
    "\n",
    "\n",
    "\n",
    "def maxmatch(T, p, w=2**12-1, max_length=2**5-1):\n",
    "    \"\"\" finds a maximum match of length k<=2**5-1 in a\n",
    "    w long window, T[p:p+k] with T[p-m:p-m+k].\n",
    "    Returns m (offset) and k (match length) \"\"\"\n",
    "    assert isinstance(T,str)\n",
    "    n = len(T)\n",
    "    maxmatch = 0\n",
    "    offset = 0\n",
    "    # Why do we need the min here?\n",
    "    for m in range(1, min(p+1, w)):\n",
    "        k = 0\n",
    "        # Why do we need the min here?\n",
    "        while k < min(n-p, max_length) and T[p-m+k] == T[p+k]:\n",
    "            k += 1\n",
    "        # at this point, T[p-m:p-m+k]==T[p:p+k]\n",
    "        if maxmatch < k:  \n",
    "            maxmatch = k\n",
    "            offset = m\n",
    "    return offset, maxmatch\n",
    "# returned offset is smallest one (closest to p) among\n",
    "# all max matches (m starts at 1)\n",
    "\n",
    "\n",
    "\n",
    "def LZW_compress(text, w=2**12-1, max_length=2**5-1):\n",
    "    \"\"\"LZW compression of an ascii text. Produces\n",
    "       a list comprising of either ascii characters\n",
    "       or pairs [m,k] where m is an offset and\n",
    "       k is a match (both are non negative integers) \"\"\"\n",
    "    result = []\n",
    "    n = len(text)\n",
    "    p = 0\n",
    "    while p<n:\n",
    "        m,k = maxmatch(text, p, w, max_length)\n",
    "        if k<2:\n",
    "            result.append(text[p]) #  a single char\n",
    "            p += 1\n",
    "        else:\n",
    "            result.append([m,k])   # two or more chars in match\n",
    "            p += k\n",
    "    return result  # produces a list composed of chars and pairs\n",
    "            \n",
    "                          \n",
    "def LZW_decompress(compressed, w=2**12-1, max_length=2**5-1):\n",
    "    \"\"\"LZW decompression from intermediate format to ascii text\"\"\"\n",
    "    result = []\n",
    "    n = len(compressed)\n",
    "    p = 0\n",
    "    while p<n:\n",
    "        if type(compressed[p]) == str:  # char, as opposed to a pair\n",
    "            result.append(compressed[p])\n",
    "            p+=1\n",
    "        else:\n",
    "            m,k = compressed[p]\n",
    "            p += 1\n",
    "            for i in range(0,k):\n",
    "                # append k times to result;  \n",
    "                result.append(result[-m])\n",
    "                # fixed offset m \"to the left\", as result itself grows\n",
    "    return \"\".join(result)\n",
    "\n",
    "\n",
    "\n",
    "def LZW_compress2(text, w=2**12-1, max_length=2**5-1):\n",
    "    \"\"\"LZW compression of an ascii text. Produces\n",
    "       a list comprising of either ascii characters\n",
    "       or pairs [m,k] where m is an offset and\n",
    "       k>3 is a match (both are non negative integers) \"\"\"\n",
    "    result = []\n",
    "    n = len(text)\n",
    "    p = 0\n",
    "    while p<n:\n",
    "        m,k = maxmatch(text, p, w, max_length)\n",
    "        if k<3: # modified from k<2\n",
    "            result.append(text[p]) # a single char\n",
    "            p += 1 #even if k was 2 (why?)\n",
    "        else:\n",
    "            result.append([m,k])   # two or more chars in match\n",
    "            p += k\n",
    "    return result  # produces a list composed of chars and pairs\n",
    "\n",
    "          \n",
    "\n",
    "def inter_to_bin(lst, w=2**12-1, max_length=2**5-1):\n",
    "    \"\"\" converts intermediate format compressed list\n",
    "       to a string of bits\"\"\"\n",
    "    offset_width = math.ceil(math.log(w,2))\n",
    "    match_width = math.ceil(math.log(max_length, 2))\n",
    "    #print(offset_width,match_width)   # for debugging\n",
    "    result = []\n",
    "    for elem in lst:\n",
    "        if type(elem) == str:\n",
    "            result.append(\"0\")\n",
    "            result.append('{:07b}'.format(ord(elem)))\n",
    "        elif type(elem) == list:\n",
    "            result.append(\"1\")\n",
    "            m,k = elem\n",
    "            result.append(str(bin(m)[2:]).zfill(offset_width))\n",
    "            result.append(str(bin(k)[2:]).zfill(match_width))     \n",
    "    return \"\".join(ch for ch in result)\n",
    "   \n",
    "   \n",
    "                         \n",
    "def bin_to_inter(compressed, w=2**12-1, max_length=2**5-1):\n",
    "    \"\"\" converts a compressed string of bits\n",
    "       to intermediate compressed format \"\"\"\n",
    "    offset_width = math.ceil(math.log(w,2))\n",
    "    match_width = math.ceil(math.log(max_length,2))\n",
    "    #print(offset_width,match_width)   # for debugging\n",
    "    result = []\n",
    "    n = len(compressed)\n",
    "    p = 0\n",
    "    while p<n:\n",
    "        if compressed[p] == \"0\":  # single ascii char\n",
    "            p += 1\n",
    "            char = chr(int(compressed[p:p+7], 2))\n",
    "            result.append(char)\n",
    "            p += 7\n",
    "        elif compressed[p] == \"1\":  # repeat of length > 2\n",
    "            p += 1\n",
    "            m = int(compressed[p:p+offset_width],2)\n",
    "            p += offset_width\n",
    "            k = int(compressed[p:p+match_width],2)\n",
    "            p += match_width\n",
    "            result.append([m,k])\n",
    "    return result\n",
    "\n",
    "\n",
    "#########################\n",
    "### Various executions\n",
    "#########################\n",
    "\n",
    "def process1(text):\n",
    "    \"\"\" packages the whole process using LZ_compress \"\"\"\n",
    "    atext = str_to_ascii(text)\n",
    "    inter = LZW_compress(atext)\n",
    "    binn = inter_to_bin(inter)\n",
    "    inter2 = bin_to_inter(binn)\n",
    "    text2 = LZW_decompress(inter)\n",
    "    return inter, binn, inter2, text2\n",
    "\n",
    "\n",
    "def process2(text):\n",
    "    \"\"\" packages the whole process using LZ_compress2 \"\"\"\n",
    "    atext = str_to_ascii(text)\n",
    "    inter = LZW_compress2(atext)\n",
    "    binn = inter_to_bin(inter)\n",
    "    inter2 = bin_to_inter(binn)\n",
    "    text2 = LZW_decompress(inter2)\n",
    "    return inter, binn, inter2, text2\n",
    "   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The LZ Encoding Process"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "inter1, bits, inter2, text2 = process2(\"abcdabc\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What is the encoded representation?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['a', 'b', 'c', 'd', [4, 3]]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "'01100001011000100110001101100100100000000010000011'"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inter1\n",
    "bits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "50"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(bits)#8*4+18"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How do we really encode the chars?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['01100001', '01100010', '01100011', '01100100']"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "['1100001', '1100010', '1100011', '1100100']"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[bits[8*i:8*(i+1)] for i in range(4)]\n",
    "[bin(ord(c))[2:] for c in \"abcd\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Going back"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'abcdabc'"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['x', 'y', 'z', [3, 3], 'w', [4, 4]]"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inter1, bits, inter2, text2 = process2(\"xyzxyzwxyzw\")\n",
    "inter1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Compression ratio"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "def lz_ratio(text):\n",
    "    inter1, bits, inter2, text2 = process2(text)\n",
    "    return len(bits)/(len(text)*7)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.086"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "0.08317142857142858"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lz_ratio('a' * 1000)\n",
    "lz_ratio('a' * 10000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So what happens for increasing ```lz_ratio('a' * n)``` as $n \\to \\infty$?\n",
    "\n",
    "Each repetition in the LZ compression requires 18 bits and we need $n/31$ such windows. On the other hand, the regular ASCII representation requires 7 bits per character, thus the ratio approaches $18/(31\\cdot 7)$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.08294930875576037"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "18/(31*7)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.1428571428571428"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "0.8285714285714286"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "0.8285714285714286"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lz_ratio('hello')\n",
    "lz_ratio('hello' * 2)\n",
    "(8*5+18) / (7*10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Comparing Huffman and LZ compression ratios"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "\n",
    "asci = str.join(\"\",[chr(c) for c in range(128)])\n",
    "rand = \"\".join([random.choice(asci) for i in range(1000)])\n",
    "s = \"\".join([chr(ord(\"a\")+i-1)*i for i in range(1,27)])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'abbcccddddeeeeeffffffggggggghhhhhhhhiiiiiiiiijjjjjjjjjjkkkkkkkkkkkllllllllllllmmmmmmmmmmmmmnnnnnnnnnnnnnnoooooooooooooooppppppppppppppppqqqqqqqqqqqqqqqqqrrrrrrrrrrrrrrrrrrsssssssssssssssssssttttttttttttttttttttuuuuuuuuuuuuuuuuuuuuuvvvvvvvvvvvvvvvvvvvvvvwwwwwwwwwwwwwwwwwwwwwwwxxxxxxxxxxxxxxxxxxxxxxxxyyyyyyyyyyyyyyyyyyyyyyyyyzzzzzzzzzzzzzzzzzzzzzzzzzz'"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Huffman compression ratio: 0.9935714285714285\n",
      "LZ      compression ratio: 1.1428571428571428\n",
      "Huffman compression ratio: 0.6866096866096866\n",
      "LZ      compression ratio: 0.2629222629222629\n",
      "Huffman compression ratio: 0.6538857142857143\n",
      "LZ      compression ratio: 0.6195142857142857\n"
     ]
    }
   ],
   "source": [
    "def compare_lz_huffman(text):\n",
    "    print(\"Huffman compression ratio:\", compression_ratio(text,text+asci))\n",
    "    print(\"LZ      compression ratio:\", lz_ratio(text))\n",
    "    \n",
    "compare_lz_huffman(rand)\n",
    "compare_lz_huffman(s)\n",
    "compare_lz_huffman(open(\"looking_glass.txt\", \"r\").read()[:10000])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An explanation of the results can be found <a href=\"http://tau-cs1001-py.wdfiles.com/local--files/recitation-logs-2016a/LZ_huffman_comparison.pdf\">here</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}