{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1><center>cs1001.py , Tel Aviv University, Spring 2019</center></h1>\n",
    "<img src=\"http://www.pngall.com/wp-content/uploads/2016/05/Python-Logo-PNG-Image-180x180.png\" width=50/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Recitation 11\n",
    "\n",
    "We discussed two text-compression algorithms: Huffman coding and Lempel-Ziv."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###### Takeaways:\n",
    "- Huffman Coding is a lossless compression technique. Given a corpus, a character frequency analysis is performed and a prefix free tree is built according to the analysis\n",
    "- Given a text, the characters are encoded using the tree\n",
    "- Given an encoding, decoding is performed in a \"greedy\" manner (for this reason a prefix free encoding is crucial)\n",
    "- While Huffman coding is optimal, if the character frequency of the corpus differs from that of the encoded text, the compression can be inefficient"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Code for printing several outputs in one cell (not part of the recitation):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.core.interactiveshell import InteractiveShell\n",
    "InteractiveShell.ast_node_interactivity = \"all\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Huffman Coding\n",
    "\n",
    "Generates a variable-length, prefix-free code, based on the character frequencies in corpus string.\n",
    "Notice that there are several degrees of freedom that could lead to different tree structures."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"huffman.PNG\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def ascii2bit_stream(text):\n",
    "    \"\"\" Translates ASCII text to binary reprersentation using\n",
    "        7 bits per character. Assume only ASCII chars \"\"\"\n",
    "    return \"\".join([bin(ord(c))[2:].zfill(7) for c in text])\n",
    "\n",
    "\n",
    "\n",
    "########################################################\n",
    "#### HUFFMAN CODE\n",
    "########################################################\n",
    "\n",
    "def char_count(text):\n",
    "    \"\"\" Counts the number of each character in text.\n",
    "        Returns a dictionary, with keys being the observed characters,\n",
    "        values being the counts \"\"\"\n",
    "    d = {}\n",
    "    for ch in text:\n",
    "        if ch in d:\n",
    "            d[ch] += 1\n",
    "        else:\n",
    "            d[ch] = 1\n",
    "    return d\n",
    "\n",
    "\n",
    "def build_huffman_tree(char_count_dict):\n",
    "    \"\"\" Recieves dictionary with char:count entries\n",
    "        Generates a LIST structure representing\n",
    "        the binary Huffman encoding tree \"\"\"\n",
    "    queue = [(c,cnt) for (c,cnt) in char_count_dict.items()]\n",
    "\n",
    "    while len(queue) > 1:\n",
    "        #print(queue)\n",
    "        # combine two smallest elements\n",
    "        A, cntA = extract_min(queue)    # smallest in queue\n",
    "        B, cntB = extract_min(queue)    # next smallest\n",
    "        chars = [A,B]\n",
    "        weight = cntA + cntB            # combined weight\n",
    "        queue.append((chars, weight))   # insert combined node\n",
    "\n",
    "    # only root node left\n",
    "    #print(\"final queue:\", queue)\n",
    "    root, weight_trash = extract_min(queue) # weight_trash unused\n",
    "    return root                             #a LIST representing the tree structure\n",
    "\n",
    "\n",
    "def extract_min(queue): \n",
    "    \"\"\" queue is a list of 2-tuples (x,y).\n",
    "        remove and return the tuple with minimal y \"\"\" \n",
    "    min_pair = min(queue, key = lambda pair: pair[1])\n",
    "    queue.remove(min_pair)\n",
    "    return min_pair\n",
    "\n",
    "\n",
    "\n",
    "def generate_code(huff_tree, prefix=\"\"):\n",
    "    \"\"\" Receives a Huffman tree with embedded encoding,\n",
    "        and a prefix of encodings.\n",
    "        returns a dictionary where characters are\n",
    "        keys and associated binary strings are values.\"\"\"\n",
    "    if isinstance(huff_tree, str): # a leaf\n",
    "        return {huff_tree: prefix}\n",
    "    else:\n",
    "        lchild, rchild = huff_tree[0], huff_tree[1]\n",
    "        codebook = {}\n",
    "\n",
    "        codebook.update(generate_code(lchild, prefix+'0'))\n",
    "        codebook.update(generate_code(rchild, prefix+'1'))\n",
    "        #   oh, the beauty of recursion...\n",
    "        return codebook\n",
    "\n",
    "    \n",
    "def compress(text, encoding_dict):\n",
    "    \"\"\" compress text using encoding dictionary \"\"\"\n",
    "    assert isinstance(text, str)\n",
    "    return \"\".join(encoding_dict[ch] for ch in text)\n",
    "\n",
    "\n",
    "def reverse_dict(d):\n",
    "    \"\"\" build the \"reverse\" of encoding dictionary \"\"\"\n",
    "    return {y:x for (x,y) in d.items()}\n",
    "\n",
    "\n",
    "def decompress(bits, decoding_dict):\n",
    "    prefix = \"\"\n",
    "    result = []\n",
    "    for bit in bits:\n",
    "        prefix += bit\n",
    "        if prefix in decoding_dict:\n",
    "            result.append(decoding_dict[prefix])\n",
    "            prefix = \"\"  #restart\n",
    "    assert prefix == \"\" # must finish last codeword\n",
    "    return \"\".join(result)  # converts list of chars to a string"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The Huffman Encoding Process"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "s = \"live and let live\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Frequency dictionary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'l': 3, 'i': 2, 'v': 2, 'e': 3, ' ': 3, 'a': 1, 'n': 1, 'd': 1, 't': 1}"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "count_dict = char_count(s)\n",
    "count_dict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Build a Huffman tree:\n",
    "\n",
    "The function returns a list which represents a binary tree as follows:\n",
    "* A string in the list is a leaf\n",
    "* Otherwise, a list of size two represents two sub-trees: the left subtree and the right subtree\n",
    "\n",
    "The set of indices that lead to a leaf (char) correspond to the path leading to the leaf, and therefore to the encoding of the character in the final Huffman code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[' ', ['i', 'v']], [[['a', 'n'], ['d', 't']], ['l', 'e']]]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "[' ', ['i', 'v']]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "[[['a', 'n'], ['d', 't']], ['l', 'e']]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "' '"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "'v'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "huff_tree = build_huffman_tree(count_dict)\n",
    "\n",
    "# A list with two elements\n",
    "huff_tree\n",
    "\n",
    "# The left subtree of the root\n",
    "huff_tree[0]\n",
    "\n",
    "# The right subtree of the root\n",
    "huff_tree[1]\n",
    "\n",
    "# Leaves correspond to characters in the final code\n",
    "huff_tree[0][0]\n",
    "huff_tree[0][1][1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generate binary representation for each char in the corpus based on the tree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{' ': '00',\n",
       " 'i': '010',\n",
       " 'v': '011',\n",
       " 'a': '1000',\n",
       " 'n': '1001',\n",
       " 'd': '1010',\n",
       " 't': '1011',\n",
       " 'l': '110',\n",
       " 'e': '111'}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d = generate_code(huff_tree)\n",
    "d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compressing some text using our code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "text1 = \"tell\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'1011111110110'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "13"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compress(text1, d)\n",
    "len(compress(text1, d))#4+3+3+3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "28"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unicode_conversion = ascii2bit_stream(\"tell\") #when we convert every character to 7 bits\n",
    "len(unicode_conversion)#4*7"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Decoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'00': ' ',\n",
       " '010': 'i',\n",
       " '011': 'v',\n",
       " '1000': 'a',\n",
       " '1001': 'n',\n",
       " '1010': 'd',\n",
       " '1011': 't',\n",
       " '110': 'l',\n",
       " '111': 'e'}"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "e = reverse_dict(d)\n",
    "e"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'tell'"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "decompress('1011111110110', e)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Handling missing characters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\x00\\x01\\x02\\x03\\x04\\x05\\x06\\x07\\x08\\t\\n\\x0b\\x0c\\r\\x0e\\x0f\\x10\\x11\\x12\\x13\\x14\\x15\\x16\\x17\\x18\\x19\\x1a\\x1b\\x1c\\x1d\\x1e\\x1f !\"#$%&\\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\\x7f'"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "asci = str.join(\"\",[chr(c) for c in range(128)])\n",
    "asci"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Computing the compression ratio"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compression_ratio(text, corpus):\n",
    "    d = generate_code(build_huffman_tree(char_count(corpus)))\n",
    "    len_compress = len(compress(text, d))\n",
    "    len_ascii = len(ascii2bit_stream(text)) #len(text)*7\n",
    "    return len_compress/len_ascii"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.14285714285714285"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "0.14285714285714285"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compression_ratio(\"ab\", \"ab\")\n",
    "1/7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.2857142857142857"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "0.2857142857142857"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compression_ratio(\"ab\", \"abcd\")\n",
    "2/7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.1428571428571428"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "1.1428571428571428"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compression_ratio(\"hello\", \"a\"*100 + asci)\n",
    "8/7"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "1.1428571428571428"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compression_ratio(\"hello\", \"a\"*10 + asci)\n",
    "compression_ratio(\"hello\", \"a\"*40 + asci)\n",
    "compression_ratio(\"hello\", \"a\"*80 + asci)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What happened here? Where is the precise cutoff between a 1 and 8/7 ratio?\n",
    "\n",
    "If we have 65 occurrences of \"a\", then we are left with 127 occurences of characters that are not \"a\".\n",
    "\n",
    "Eventually, all 127 non-\"a\" chars will be grouped in a tree of depth 7. Since the tree is built by pairing items of minimal weight it will not be a full binary tree but slightly skewed - one character (the last character in the dictionary) will have an encoding of length 6 and all others an encoding of length 7 (this can easily be shown by induction).\n",
    "\n",
    "At this point the \"a\" node will be attached to the tree - \"a\" will get an encoding of length 1 and all other characters will get an encoding of length 8 (or 7 in case of the last letter in the dictionary). Since \"a\" is not in \"hello\" the compression ratio will be 8/7.\n",
    "\n",
    "When \"a\" appears less than 65 times the compression ratio depends (among other things) on the order in which the dictionary is accessed. Specifically, in our implementation this will result in a 1.0 compression ratio for \"hello\" (i.e. - all chars in \"hello\" have an encoding of length 7) but not, for example, for \"12345\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "1.1428571428571428"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "1.1428571428571428"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c_cnt = char_count(\"a\"*64 + asci)\n",
    "d = build_huffman_tree(c_cnt)\n",
    "#d[0]\n",
    "#d[1]\n",
    "code = generate_code(d)\n",
    "#code\n",
    "\n",
    "# Since all chars in \"hello\" have an encoding of length 7 the ratio will be 1.0\n",
    "compression_ratio(\"hello\", \"a\"*63 + asci)\n",
    "\n",
    "# There are however characters with an encoding of length 8 resulting in a 8/7 ratio\n",
    "compression_ratio(\"12345\", \"a\"*63 + asci)\n",
    "\n",
    "# If \"a\" appears more than 64 times all characters (apart from \"a\" and chr(127)) will have an encdoing of length 8\n",
    "compression_ratio(\"hello\", \"a\"*64 + asci)\n",
    "\n",
    "# chr(127) is the only character with an encoding length of 7\n",
    "compression_ratio(chr(127)*5, \"a\"*64 + asci)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "compression_ratio(\"hello\", \"a\"*4 + asci)\n",
    "\n",
    "d = generate_code(build_huffman_tree(char_count(\"a\"*4 + asci)))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'11110'"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "[('\\x00', '11111010'),\n",
       " ('\\x01', '11111011'),\n",
       " ('\\x02', '11111100'),\n",
       " ('\\x03', '11111101'),\n",
       " ('\\x04', '11111110'),\n",
       " ('\\x05', '11111111')]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d[\"a\"]\n",
    "\n",
    "[(key,d[key]) for key in d.keys() if len(d[key]) > 7]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'0'"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "'11101001'"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "126"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d = generate_code(build_huffman_tree(char_count(\"a\"*100 + asci)))\n",
    "d['a']\n",
    "d['h']\n",
    "len([(key,d[key]) for key in d.keys() if len(d[key]) > 7])\n",
    "len([(key,d[key]) for key in d.keys() if 1 < len(d[key]) <= 7])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lempel-Ziv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The LZ compression algorithm works by encoding succinct representations of repetitions in the text. A repetition of the form $[offset,length]$, means that the next $length$ characters appeared $offset$ characters earlier.\n",
    "\n",
    "- It might be that: offset < length\n",
    "- Usually, offset < 4096 and length < 32"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"lz.PNG\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "import math\n",
    "\n",
    "def str_to_ascii(text):\n",
    "    \"\"\" Gets rid of on ascii characters in text\"\"\"\n",
    "    return ''.join(ch for ch in text if ord(ch)<128)\n",
    "\n",
    "\n",
    "\n",
    "def maxmatch(T, p, w=2**12-1, max_length=2**5-1):\n",
    "    \"\"\" finds a maximum match of length k<=2**5-1 in a\n",
    "    w long window, T[p:p+k] with T[p-m:p-m+k].\n",
    "    Returns m (offset) and k (match length) \"\"\"\n",
    "    assert isinstance(T,str)\n",
    "    n = len(T)\n",
    "    maxmatch = 0\n",
    "    offset = 0\n",
    "    # Why do we need the min here?\n",
    "    for m in range(1, min(p+1, w)):\n",
    "        k = 0\n",
    "        # Why do we need the min here?\n",
    "        while k < min(n-p, max_length) and T[p-m+k] == T[p+k]:\n",
    "            k += 1\n",
    "        # at this point, T[p-m:p-m+k]==T[p:p+k]\n",
    "        if maxmatch < k:  \n",
    "            maxmatch = k\n",
    "            offset = m\n",
    "    return offset, maxmatch\n",
    "# returned offset is smallest one (closest to p) among\n",
    "# all max matches (m starts at 1)\n",
    "\n",
    "\n",
    "\n",
    "def LZW_compress(text, w=2**12-1, max_length=2**5-1):\n",
    "    \"\"\"LZW compression of an ascii text. Produces\n",
    "       a list comprising of either ascii characters\n",
    "       or pairs [m,k] where m is an offset and\n",
    "       k is a match (both are non negative integers) \"\"\"\n",
    "    result = []\n",
    "    n = len(text)\n",
    "    p = 0\n",
    "    while p<n:\n",
    "        m,k = maxmatch(text, p, w, max_length)\n",
    "        if k<2:\n",
    "            result.append(text[p]) #  a single char\n",
    "            p += 1\n",
    "        else:\n",
    "            result.append([m,k])   # two or more chars in match\n",
    "            p += k\n",
    "    return result  # produces a list composed of chars and pairs\n",
    "            \n",
    "                          \n",
    "def LZW_decompress(compressed, w=2**12-1, max_length=2**5-1):\n",
    "    \"\"\"LZW decompression from intermediate format to ascii text\"\"\"\n",
    "    result = []\n",
    "    n = len(compressed)\n",
    "    p = 0\n",
    "    while p<n:\n",
    "        if type(compressed[p]) == str:  # char, as opposed to a pair\n",
    "            result.append(compressed[p])\n",
    "            p+=1\n",
    "        else:\n",
    "            m,k = compressed[p]\n",
    "            p += 1\n",
    "            for i in range(0,k):\n",
    "                # append k times to result;  \n",
    "                result.append(result[-m])\n",
    "                # fixed offset m \"to the left\", as result itself grows\n",
    "    return \"\".join(result)\n",
    "\n",
    "\n",
    "\n",
    "def LZW_compress2(text, w=2**12-1, max_length=2**5-1):\n",
    "    \"\"\"LZW compression of an ascii text. Produces\n",
    "       a list comprising of either ascii characters\n",
    "       or pairs [m,k] where m is an offset and\n",
    "       k>3 is a match (both are non negative integers) \"\"\"\n",
    "    result = []\n",
    "    n = len(text)\n",
    "    p = 0\n",
    "    while p<n:\n",
    "        m,k = maxmatch(text, p, w, max_length)\n",
    "        if k<3: # modified from k<2\n",
    "            result.append(text[p]) # a single char\n",
    "            p += 1 #even if k was 2 (why?)\n",
    "        else:\n",
    "            result.append([m,k])   # two or more chars in match\n",
    "            p += k\n",
    "    return result  # produces a list composed of chars and pairs\n",
    "\n",
    "          \n",
    "\n",
    "def inter_to_bin(lst, w=2**12-1, max_length=2**5-1):\n",
    "    \"\"\" converts intermediate format compressed list\n",
    "       to a string of bits\"\"\"\n",
    "    offset_width = math.ceil(math.log(w,2))\n",
    "    match_width = math.ceil(math.log(max_length, 2))\n",
    "    #print(offset_width,match_width)   # for debugging\n",
    "    result = []\n",
    "    for elem in lst:\n",
    "        if type(elem) == str:\n",
    "            result.append(\"0\")\n",
    "            result.append('{:07b}'.format(ord(elem)))\n",
    "        elif type(elem) == list:\n",
    "            result.append(\"1\")\n",
    "            m,k = elem\n",
    "            result.append(str(bin(m)[2:]).zfill(offset_width))\n",
    "            result.append(str(bin(k)[2:]).zfill(match_width))     \n",
    "    return \"\".join(ch for ch in result)\n",
    "   \n",
    "   \n",
    "                         \n",
    "def bin_to_inter(compressed, w=2**12-1, max_length=2**5-1):\n",
    "    \"\"\" converts a compressed string of bits\n",
    "       to intermediate compressed format \"\"\"\n",
    "    offset_width = math.ceil(math.log(w,2))\n",
    "    match_width = math.ceil(math.log(max_length,2))\n",
    "    #print(offset_width,match_width)   # for debugging\n",
    "    result = []\n",
    "    n = len(compressed)\n",
    "    p = 0\n",
    "    while p<n:\n",
    "        if compressed[p] == \"0\":  # single ascii char\n",
    "            p += 1\n",
    "            char = chr(int(compressed[p:p+7], 2))\n",
    "            result.append(char)\n",
    "            p += 7\n",
    "        elif compressed[p] == \"1\":  # repeat of length > 2\n",
    "            p += 1\n",
    "            m = int(compressed[p:p+offset_width],2)\n",
    "            p += offset_width\n",
    "            k = int(compressed[p:p+match_width],2)\n",
    "            p += match_width\n",
    "            result.append([m,k])\n",
    "    return result\n",
    "\n",
    "\n",
    "#########################\n",
    "### Various executions\n",
    "#########################\n",
    "\n",
    "def process1(text):\n",
    "    \"\"\" packages the whole process using LZ_compress \"\"\"\n",
    "    atext = str_to_ascii(text)\n",
    "    inter = LZW_compress(atext)\n",
    "    binn = inter_to_bin(inter)\n",
    "    inter2 = bin_to_inter(binn)\n",
    "    text2 = LZW_decompress(inter)\n",
    "    return inter, binn, inter2, text2\n",
    "\n",
    "\n",
    "def process2(text):\n",
    "    \"\"\" packages the whole process using LZ_compress2 \"\"\"\n",
    "    atext = str_to_ascii(text)\n",
    "    inter = LZW_compress2(atext)\n",
    "    binn = inter_to_bin(inter)\n",
    "    inter2 = bin_to_inter(binn)\n",
    "    text2 = LZW_decompress(inter2)\n",
    "    return inter, binn, inter2, text2\n",
    "   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### The LZ Encoding Process"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "inter1, bits, inter2, text2 = process2(\"abcdabc\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What is the encoded representation?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['a', 'b', 'c', 'd', [4, 3]]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "'01100001011000100110001101100100100000000010000011'"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inter1\n",
    "bits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "50"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(bits)#8*4+18"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How do we really encode the chars?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['01100001', '01100010', '01100011', '01100100']"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "['1100001', '1100010', '1100011', '1100100']"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[bits[8*i:8*(i+1)] for i in range(4)]\n",
    "[bin(ord(c))[2:] for c in \"abcd\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Going back"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'abcdabc'"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['x', 'y', 'z', [3, 3], 'w', [4, 4]]"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inter1, bits, inter2, text2 = process2(\"xyzxyzwxyzw\")\n",
    "inter1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Compression ratio"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "def lz_ratio(text):\n",
    "    inter1, bits, inter2, text2 = process2(text)\n",
    "    return len(bits)/(len(text)*7)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.086"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "0.08317142857142858"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lz_ratio('a' * 1000)\n",
    "lz_ratio('a' * 10000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So what happens for increasing $lz\\_ratio('a' \\cdot n)$ as $n \\to \\infty$?\n",
    "\n",
    "Each repetition in the LZ compression requires 18 bits and we need $n/31$ such windows. On the other hand, the regular ASCII representation requires 7 bits per character, thus the ratio approaches $18/(31\\cdot 7)$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.08294930875576037"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "18/(31*7)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.1428571428571428"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "0.8285714285714286"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "text/plain": [
       "0.8285714285714286"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lz_ratio('hello')\n",
    "lz_ratio('hello' * 2)\n",
    "(8*5+18) / (7*10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Comparing Huffman and LZ compression ratios"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "\n",
    "asci = str.join(\"\",[chr(c) for c in range(128)])\n",
    "rand = \"\".join([random.choice(asci) for i in range(1000)])\n",
    "s = \"\".join([chr(ord(\"a\")+i-1)*i for i in range(1,27)])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'abbcccddddeeeeeffffffggggggghhhhhhhhiiiiiiiiijjjjjjjjjjkkkkkkkkkkkllllllllllllmmmmmmmmmmmmmnnnnnnnnnnnnnnoooooooooooooooppppppppppppppppqqqqqqqqqqqqqqqqqrrrrrrrrrrrrrrrrrrsssssssssssssssssssttttttttttttttttttttuuuuuuuuuuuuuuuuuuuuuvvvvvvvvvvvvvvvvvvvvvvwwwwwwwwwwwwwwwwwwwwwwwxxxxxxxxxxxxxxxxxxxxxxxxyyyyyyyyyyyyyyyyyyyyyyyyyzzzzzzzzzzzzzzzzzzzzzzzzzz'"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Huffman compression ratio: 0.9935714285714285\n",
      "LZ      compression ratio: 1.1428571428571428\n",
      "Huffman compression ratio: 0.6866096866096866\n",
      "LZ      compression ratio: 0.2629222629222629\n",
      "Huffman compression ratio: 0.6538857142857143\n",
      "LZ      compression ratio: 0.6195142857142857\n"
     ]
    }
   ],
   "source": [
    "def compare_lz_huffman(text):\n",
    "    print(\"Huffman compression ratio:\", compression_ratio(text,text+asci))\n",
    "    print(\"LZ      compression ratio:\", lz_ratio(text))\n",
    "    \n",
    "compare_lz_huffman(rand)\n",
    "compare_lz_huffman(s)\n",
    "compare_lz_huffman(open(\"looking_glass.txt\", \"r\").read()[:10000])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An explanation of the results can be found <a href=\"http://tau-cs1001-py.wdfiles.com/local--files/recitation-logs-2016a/LZ_huffman_comparison.pdf\">here</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}