{ "cells": [ { "cell_type": "code", "execution_count": 12, "id": "1b07609a", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import math" ] }, { "cell_type": "markdown", "id": "e65e7731", "metadata": {}, "source": [ "# Robert a.k.a. 00011 a.k.a 1100010100000010011\n", "> Compression" ] }, { "cell_type": "markdown", "id": "9f014579", "metadata": {}, "source": [ "Fun with compressing names which is either interesting to you as a reader or not :) It does connect to data engineering and data science:\n", "\n", "* Column storage with very similar data, check out Roaring Bitmaps\n", "* An old idea to compare music is to compress a song together with a catalogue of music from a genre. The lowest compression is then the genre.\n", "* How well data compresses tells you something about the intrisic complexity/richness of the data. Having a large dataset is a very basic measure, compression (and ideas like compression) tells you more.\n", "* Encoder/decoders kindof work like lossy compression.\n", "\n", "This article will be followed up with compressing timeseries." ] }, { "cell_type": "markdown", "id": "f7995d9c", "metadata": {}, "source": [ "## Read census data\n", "Read in the 1990 census data of the USA" ] }, { "cell_type": "code", "execution_count": 13, "id": "bf8859ad", "metadata": {}, "outputs": [], "source": [ "def parse_perc(s: str) -> float:\n", " return float(s.strip('%')) / 100\n", "\n", "def parse_count(s: str) -> int:\n", " return int(s.replace(',', ''))\n", "\n", "df = pd.read_csv(\"names.csv\", sep=\";\", \n", " converters={\n", " 'Frequency (%) M': parse_perc,\n", " 'Frequency (%) F': parse_perc,\n", " 'Count M': parse_count,\n", " 'Count F': parse_count\n", " })" ] }, { "cell_type": "code", "execution_count": 14, "id": "a10370aa", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
RankName MFrequency (%) MCount MName FFrequency (%) FCount F
995996BURL0.000116235ROBERT0.000125976
996997WALKER0.000116235ISABELLA0.000125976
997998TYREE0.000116235HERMINIA0.000125976
998999JEFFEREY0.000116235TERRA0.000125976
9991000AHMED0.000116235CELINA0.000125976
\n", "
" ], "text/plain": [ " Rank Name M Frequency (%) M Count M Name F Frequency (%) F \\\n", "995 996 BURL 0.0001 16235 ROBERT 0.0001 \n", "996 997 WALKER 0.0001 16235 ISABELLA 0.0001 \n", "997 998 TYREE 0.0001 16235 HERMINIA 0.0001 \n", "998 999 JEFFEREY 0.0001 16235 TERRA 0.0001 \n", "999 1000 AHMED 0.0001 16235 CELINA 0.0001 \n", "\n", " Count F \n", "995 25976 \n", "996 25976 \n", "997 25976 \n", "998 25976 \n", "999 25976 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail()" ] }, { "cell_type": "markdown", "id": "2b3d1bc9", "metadata": {}, "source": [ "## Processing the data to 2-grams" ] }, { "cell_type": "markdown", "id": "cea2b5b9", "metadata": {}, "source": [ "The idea is that we construct character 2-grams and list the top 4 letters following a letter. For example, `Jefferey` would result in `[je, ef, ff, fe, er, re, ey]`. The leter `e` is the start of three 2-grams: `ef`, `er` and `ey`. By summing up the occurences of each 2-gram we can find out what's the most likely character following a specified letter." ] }, { "cell_type": "markdown", "id": "b2a16428", "metadata": {}, "source": [ "Below we select the top 4:" ] }, { "cell_type": "code", "execution_count": 15, "id": "fcf0bb2b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
freq
firstsecond
AL0.0549
M0.0725
N0.1032
R0.1251
BE0.0482
\n", "
" ], "text/plain": [ " freq\n", "first second \n", "A L 0.0549\n", " M 0.0725\n", " N 0.1032\n", " R 0.1251\n", "B E 0.0482" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Construct 2-grams\n", "top = 4\n", "\n", "def two_grams(s: str):\n", " return list( zip(s[:-1], s[1:]))\n", "\n", "# Construct flattened list of weighted 2-grams and get the top 8 per letter\n", "weighted_two_grams = pd.DataFrame(\n", " df.apply(\n", " lambda row: [(a,b, row['Frequency (%) M']) for a,b in two_grams(row['Name M'])],\n", " axis=1\n", " ).apply(pd.Series).stack().to_list(),\n", " columns=['first', 'second', 'freq']\n", ").groupby(by=['first', 'second']).sum().sort_values(by='freq', ascending=False).groupby(by='first').head(top).sort_index()\n", "\n", "weighted_two_grams.head()" ] }, { "cell_type": "markdown", "id": "fd4021fe", "metadata": {}, "source": [ "## Encoding names" ] }, { "cell_type": "markdown", "id": "3429963f", "metadata": {}, "source": [ "To simplify we assume we only have 26 letters. To encode names we go through the name and output a (binary) code for each letter. We distinguish two types of codes:\n", " \n", "1. The number of the letter in the alphabet (1-26)\n", "2. The position in the top-4 of letters of the letter _preceding_ this one\n", "\n", "\n", "Let's define some functions that do this for us! " ] }, { "cell_type": "code", "execution_count": 16, "id": "29186958", "metadata": {}, "outputs": [], "source": [ "def encode_char(c, prev): \n", " if prev is not None and c in weighted_two_grams.loc[(prev,)].index:\n", " return (0, weighted_two_grams.loc[(prev,)].index.get_loc(c))\n", " else:\n", " return (1, ord(c)-65)\n", "\n", "def encode(name):\n", " encoded = []\n", " prev = None\n", " for c in name:\n", " code = encode_char(c, prev)\n", " prev = c\n", " encoded.append(code)\n", " return encoded\n", "\n", "def decode_char(code, prev):\n", " t, nr = code\n", " if prev is not None and t == 0:\n", " return weighted_two_grams.loc[(prev,)].iloc[nr].name\n", " else:\n", " return chr(nr+65)\n", "\n", "def decode(codes):\n", " decoded = []\n", " prev = None\n", " for code in codes:\n", " prev = decode_char(code, prev)\n", " decoded.append(prev)\n", " return ''.join(decoded)\n" ] }, { "cell_type": "markdown", "id": "fd339894", "metadata": {}, "source": [ "As an example let's see how `robert` is encoded and decoded:" ] }, { "cell_type": "code", "execution_count": 17, "id": "44b34bf3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(1, 17), (0, 2), (0, 0), (0, 0), (0, 2), (0, 3)]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encode(\"ROBERT\")" ] }, { "cell_type": "code", "execution_count": 18, "id": "031274cd", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'ROBERT'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "decode(encode(\"ROBERT\"))" ] }, { "cell_type": "markdown", "id": "ad97bc18", "metadata": {}, "source": [ "## How much bits do we need?\n", "\n", "We could encode the two types of codes with a `0` followed by 2 bits or `1` followed by 5 bits. The two bits give the position in the top 4, and the 5 bits give the position in the 26 letter alphabet (2^5=32). There might be some room to use even less bits with some trickery.\n", "\n", "The 2-gram lookup table requires 26*4*5=520 bits: sequentially store the top 4 for a...z, and there is no top four pad with zeros)." ] }, { "cell_type": "markdown", "id": "4c855284", "metadata": {}, "source": [ "Suppose we encounter the top 1000 names proportionally than on average we need 21.92 bits per name:" ] }, { "cell_type": "code", "execution_count": 19, "id": "8dc17cda", "metadata": {}, "outputs": [], "source": [ "def nr_bits(codes):\n", " return sum([ 1 + (math.log2(top) if t == 0 else 5) for t, nr in codes])" ] }, { "cell_type": "code", "execution_count": 20, "id": "95de8b6b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "21.919200000000004" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['enc'] = df['Name M'].map(lambda x: nr_bits(encode(x)))\n", "(df['enc'] * df['Frequency (%) M']).sum()" ] }, { "cell_type": "markdown", "id": "9dbf8d31", "metadata": {}, "source": [ "Robert would be encoded as" ] }, { "cell_type": "code", "execution_count": 30, "id": "ba982b4d", "metadata": {}, "outputs": [], "source": [ "def to_bin(encoding):\n", " return ''.join([str(code) + format(nr, 'b') for code, nr in encoding])" ] }, { "cell_type": "code", "execution_count": 31, "id": "05eb2780", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'1100010100000010011'" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "to_bin(encode('ROBERT'))" ] }, { "cell_type": "markdown", "id": "9ffb2ed0", "metadata": {}, "source": [ "## Can we do better?\n", "If the first 1000 first names really cover 90% we can also encode the first 1000 names in a fixed list. We then encode it as follows:\n", " \n", "* Single bit 0 if not in the list and 1 if in the list\n", "* If in the list 10 bits a number <1000\n", "* If not in the list 5 bits per letter\n", "\n", "The table takes 5677 * 5 = 28385 bits to store. On average this takes 10 bits per name.\n", "\n", "The break-even point is at roughly encoding 2300 names from the top 1000." ] }, { "cell_type": "code", "execution_count": 33, "id": "2a90ff88", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2322.0833333333335" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(28385-520)/12" ] }, { "cell_type": "markdown", "id": "0a5096ce", "metadata": {}, "source": [ "## Can we do _even_ better?\n" ] }, { "cell_type": "markdown", "id": "5e852457", "metadata": {}, "source": [ "Probably the best idea is to mix the frequency table approach and encoding the top 200 names, as these already cover 72% of all names. The frequency table is very small and also works for names outside of the top 1000." ] }, { "cell_type": "code", "execution_count": 284, "id": "28823ad9", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7261" ] }, "execution_count": 284, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.sort_values(by='Frequency (%) M', ascending=False).head(200)['Frequency (%) M'].sum()" ] }, { "cell_type": "markdown", "id": "416c8e7a", "metadata": {}, "source": [ "This is left as an exercise for the reader." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 5 }