{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "1b07609a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import math"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e65e7731",
   "metadata": {},
   "source": [
    "# Robert a.k.a. 00011 a.k.a 1100010100000010011\n",
    "> Compression"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f014579",
   "metadata": {},
   "source": [
    "Fun with compressing names which is either interesting to you as a reader or not :) It does connect to data engineering and data science:\n",
    "\n",
    "* Column storage with very similar data, check out Roaring Bitmaps\n",
    "* An old idea to compare music is to compress a song together with a catalogue of music from a genre. The lowest compression is then the genre.\n",
    "* How well data compresses tells you something about the intrisic complexity/richness of the data. Having a large dataset is a very basic measure, compression (and ideas like compression) tells you more.\n",
    "* Encoder/decoders kindof work like lossy compression.\n",
    "\n",
    "This article will be followed up with compressing timeseries."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7995d9c",
   "metadata": {},
   "source": [
    "## Read census data\n",
    "Read in the 1990 census data of the USA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "bf8859ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "def parse_perc(s: str) -> float:\n",
    "    return float(s.strip('%')) / 100\n",
    "\n",
    "def parse_count(s: str) -> int:\n",
    "    return int(s.replace(',', ''))\n",
    "\n",
    "df = pd.read_csv(\"names.csv\", sep=\";\", \n",
    "                 converters={\n",
    "                     'Frequency (%) M': parse_perc,\n",
    "                     'Frequency (%) F': parse_perc,\n",
    "                     'Count M': parse_count,\n",
    "                     'Count F': parse_count\n",
    "                 })"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "a10370aa",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Rank</th>\n",
       "      <th>Name M</th>\n",
       "      <th>Frequency (%) M</th>\n",
       "      <th>Count M</th>\n",
       "      <th>Name F</th>\n",
       "      <th>Frequency (%) F</th>\n",
       "      <th>Count F</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>995</th>\n",
       "      <td>996</td>\n",
       "      <td>BURL</td>\n",
       "      <td>0.0001</td>\n",
       "      <td>16235</td>\n",
       "      <td>ROBERT</td>\n",
       "      <td>0.0001</td>\n",
       "      <td>25976</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>996</th>\n",
       "      <td>997</td>\n",
       "      <td>WALKER</td>\n",
       "      <td>0.0001</td>\n",
       "      <td>16235</td>\n",
       "      <td>ISABELLA</td>\n",
       "      <td>0.0001</td>\n",
       "      <td>25976</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>997</th>\n",
       "      <td>998</td>\n",
       "      <td>TYREE</td>\n",
       "      <td>0.0001</td>\n",
       "      <td>16235</td>\n",
       "      <td>HERMINIA</td>\n",
       "      <td>0.0001</td>\n",
       "      <td>25976</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>998</th>\n",
       "      <td>999</td>\n",
       "      <td>JEFFEREY</td>\n",
       "      <td>0.0001</td>\n",
       "      <td>16235</td>\n",
       "      <td>TERRA</td>\n",
       "      <td>0.0001</td>\n",
       "      <td>25976</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>999</th>\n",
       "      <td>1000</td>\n",
       "      <td>AHMED</td>\n",
       "      <td>0.0001</td>\n",
       "      <td>16235</td>\n",
       "      <td>CELINA</td>\n",
       "      <td>0.0001</td>\n",
       "      <td>25976</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Rank    Name M  Frequency (%) M  Count M    Name F  Frequency (%) F  \\\n",
       "995   996      BURL           0.0001    16235    ROBERT           0.0001   \n",
       "996   997    WALKER           0.0001    16235  ISABELLA           0.0001   \n",
       "997   998     TYREE           0.0001    16235  HERMINIA           0.0001   \n",
       "998   999  JEFFEREY           0.0001    16235     TERRA           0.0001   \n",
       "999  1000     AHMED           0.0001    16235    CELINA           0.0001   \n",
       "\n",
       "     Count F  \n",
       "995    25976  \n",
       "996    25976  \n",
       "997    25976  \n",
       "998    25976  \n",
       "999    25976  "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2b3d1bc9",
   "metadata": {},
   "source": [
    "## Processing the data to 2-grams"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cea2b5b9",
   "metadata": {},
   "source": [
    "The idea is that we construct character 2-grams and list the top 4 letters following a letter. For example, `Jefferey` would result in `[je, ef, ff, fe, er, re, ey]`. The leter `e` is the start of three 2-grams: `ef`, `er` and `ey`. By summing up the occurences of each 2-gram we can find out what's the most likely character following a specified letter."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2a16428",
   "metadata": {},
   "source": [
    "Below we select the top 4:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "fcf0bb2b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>freq</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>first</th>\n",
       "      <th>second</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"4\" valign=\"top\">A</th>\n",
       "      <th>L</th>\n",
       "      <td>0.0549</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>M</th>\n",
       "      <td>0.0725</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>N</th>\n",
       "      <td>0.1032</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>R</th>\n",
       "      <td>0.1251</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>B</th>\n",
       "      <th>E</th>\n",
       "      <td>0.0482</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                freq\n",
       "first second        \n",
       "A     L       0.0549\n",
       "      M       0.0725\n",
       "      N       0.1032\n",
       "      R       0.1251\n",
       "B     E       0.0482"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Construct 2-grams\n",
    "top = 4\n",
    "\n",
    "def two_grams(s: str):\n",
    "    return list( zip(s[:-1], s[1:]))\n",
    "\n",
    "# Construct flattened list of weighted 2-grams and get the top 8 per letter\n",
    "weighted_two_grams = pd.DataFrame(\n",
    "    df.apply(\n",
    "        lambda row: [(a,b, row['Frequency (%) M']) for a,b in two_grams(row['Name M'])],\n",
    "        axis=1\n",
    "    ).apply(pd.Series).stack().to_list(),\n",
    "    columns=['first', 'second', 'freq']\n",
    ").groupby(by=['first', 'second']).sum().sort_values(by='freq', ascending=False).groupby(by='first').head(top).sort_index()\n",
    "\n",
    "weighted_two_grams.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd4021fe",
   "metadata": {},
   "source": [
    "## Encoding names"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3429963f",
   "metadata": {},
   "source": [
    "To simplify we assume we only have 26 letters. To encode names we go through the name and output a (binary) code for each letter. We distinguish two types of codes:\n",
    "    \n",
    "1. The number of the letter in the alphabet (1-26)\n",
    "2. The position in the top-4 of letters of the letter _preceding_ this one\n",
    "\n",
    "\n",
    "Let's define some functions that do this for us! "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "29186958",
   "metadata": {},
   "outputs": [],
   "source": [
    "def encode_char(c, prev):  \n",
    "    if prev is not None and c in weighted_two_grams.loc[(prev,)].index:\n",
    "        return (0, weighted_two_grams.loc[(prev,)].index.get_loc(c))\n",
    "    else:\n",
    "        return (1, ord(c)-65)\n",
    "\n",
    "def encode(name):\n",
    "    encoded = []\n",
    "    prev = None\n",
    "    for c in name:\n",
    "        code = encode_char(c, prev)\n",
    "        prev = c\n",
    "        encoded.append(code)\n",
    "    return encoded\n",
    "\n",
    "def decode_char(code, prev):\n",
    "    t, nr = code\n",
    "    if prev is not None and t == 0:\n",
    "        return weighted_two_grams.loc[(prev,)].iloc[nr].name\n",
    "    else:\n",
    "        return chr(nr+65)\n",
    "\n",
    "def decode(codes):\n",
    "    decoded = []\n",
    "    prev = None\n",
    "    for code in codes:\n",
    "        prev = decode_char(code, prev)\n",
    "        decoded.append(prev)\n",
    "    return ''.join(decoded)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd339894",
   "metadata": {},
   "source": [
    "As an example let's see how `robert` is encoded and decoded:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "44b34bf3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(1, 17), (0, 2), (0, 0), (0, 0), (0, 2), (0, 3)]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "encode(\"ROBERT\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "031274cd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'ROBERT'"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "decode(encode(\"ROBERT\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad97bc18",
   "metadata": {},
   "source": [
    "## How much bits do we need?\n",
    "\n",
    "We could encode the two types of codes with a `0` followed by 2 bits or `1` followed by 5 bits. The two bits give the position in the top 4, and the 5 bits give the position in the 26 letter alphabet (2^5=32). There might be some room to use even less bits with some trickery.\n",
    "\n",
    "The 2-gram lookup table requires 26*4*5=520 bits: sequentially store the top 4 for a...z, and there is no top four pad with zeros)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c855284",
   "metadata": {},
   "source": [
    "Suppose we encounter the top 1000 names proportionally than on average we need 21.92 bits per name:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "8dc17cda",
   "metadata": {},
   "outputs": [],
   "source": [
    "def nr_bits(codes):\n",
    "    return sum([ 1 + (math.log2(top) if t == 0 else 5) for  t, nr in codes])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "95de8b6b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "21.919200000000004"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['enc'] = df['Name M'].map(lambda x: nr_bits(encode(x)))\n",
    "(df['enc'] * df['Frequency (%) M']).sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9dbf8d31",
   "metadata": {},
   "source": [
    "Robert would be encoded as"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "ba982b4d",
   "metadata": {},
   "outputs": [],
   "source": [
    "def to_bin(encoding):\n",
    "    return ''.join([str(code) + format(nr, 'b') for code, nr in encoding])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "05eb2780",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'1100010100000010011'"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "to_bin(encode('ROBERT'))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9ffb2ed0",
   "metadata": {},
   "source": [
    "## Can we do better?\n",
    "If the first 1000 first names really cover 90% we can also encode the first 1000 names in a fixed list. We then encode it as follows:\n",
    "    \n",
    "* Single bit 0 if not in the list and 1 if in the list\n",
    "* If in the list 10 bits a number <1000\n",
    "* If not in the list 5 bits per letter\n",
    "\n",
    "The table takes 5677 * 5 = 28385 bits to store. On average this takes 10 bits per name.\n",
    "\n",
    "The break-even point is at roughly encoding 2300 names from the top 1000."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "2a90ff88",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2322.0833333333335"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(28385-520)/12"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a5096ce",
   "metadata": {},
   "source": [
    "## Can we do _even_ better?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e852457",
   "metadata": {},
   "source": [
    "Probably the best idea is to mix the frequency table approach and encoding the top 200 names, as these already cover 72% of all names. The frequency table is very small and also works for names outside of the top 1000."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 284,
   "id": "28823ad9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.7261"
      ]
     },
     "execution_count": 284,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.sort_values(by='Frequency (%) M', ascending=False).head(200)['Frequency (%) M'].sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "416c8e7a",
   "metadata": {},
   "source": [
    "This is left as an exercise for the reader."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}