{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Santakku fonts and sign list\n",
    "\n",
    "## Provenance\n",
    "\n",
    "On the advice of Martin Worthington I visited the\n",
    "[Cambridge Cuneify+ page](http://www.hethport.uni-wuerzburg.de/cuneifont/).\n",
    "\n",
    "On that page there is a link to a\n",
    "[Würzburg page on Cuneiform fonts](http://www.hethport.uni-wuerzburg.de/cuneifont/)\n",
    "with a download link to\n",
    "[Old Babylonian Fonts](http://www.hethport.uni-wuerzburg.de/cuneifont/download/Santakku.zip),\n",
    "containing the Santakku(M) fonts and a sign list in PDF.\n",
    "\n",
    "I extracted the text from that PDF, sanitized it to one table cell per line by means of the text editor Vim,\n",
    "and that file is the source of this notebook, that tries to restore the original table in a tab separated format.\n",
    "\n",
    "The PDF is in the *docs* directory of this repo.\n",
    "\n",
    "The sanitized text file is the file *sources/writing/Santakku.txt* in this repository."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Problems\n",
    "\n",
    "While the text extraction and sanitizing went reasonably well, there are problems with empty cells.\n",
    "\n",
    "The table is seven columns wide, but there are not seven lines per row in the text file due to missing cells.\n",
    "\n",
    "Yet we can align by means of the typical values in the cells (unicode code points, characters, small numbers).\n",
    "\n",
    "Sometimes the values are also missing.\n",
    "\n",
    "We ignore the values in the Santakku columns and also the value, so we will not suffer much by this problem.\n",
    "\n",
    "## Results\n",
    "\n",
    "We just extract these columns:\n",
    "\n",
    "* `Unicode` i.e. unicode code point,\n",
    "* `signe` i.e. grapheme,\n",
    "* `Autotext` i.e. reading"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [],
   "source": [
    "BASE = os.path.expanduser(\"~/github\")\n",
    "ORG = \"Nino-cunei\"\n",
    "REPO = \"oldbabylonian\"\n",
    "\n",
    "REPO_DIR = f\"{BASE}/{ORG}/{REPO}\"\n",
    "\n",
    "SRC = f\"{REPO_DIR}/sources/writing/Santakku.txt\"\n",
    "\n",
    "CUNEI_START = int(\"12000\", 16)\n",
    "CUNEI_END = int(\"13000\", 16)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "uniCandRe = re.compile(r\"\"\"^\\s*([0-9A-Fa-f]{5}[ +]*)+$\"\"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ERROR at line 366: out of sync \"141 12100 GI\"\n",
      "['141 12100 GI', '140 12363', 'ZI', 'mud', 'JM', 'UY', 'MUD (ḪU-ḪI)']\n",
      "ERROR at line 367: missing Borger number \"141 12100 GI\"\n",
      "['141 12100 GI', '140 12363', 'ZI', 'mud', 'JM', 'UY', 'MUD (ḪU-ḪI)']\n",
      "Seen 365 lines\n",
      "ERROR detected\n"
     ]
    }
   ],
   "source": [
    "# code below not working because I do not yet correctly all unicode code point strings\n",
    "# correctly, eg \"140 12363\"\n",
    "\n",
    "\n",
    "def makeMapping():\n",
    "    mapping = {}\n",
    "\n",
    "    def finishUni():\n",
    "        if curGrapheme is None:\n",
    "            print(f'ERROR at line {i + 1}: missing grapheme for uni \"{curUni}\"')\n",
    "            print(list(reversed(prevLines)))\n",
    "            return False\n",
    "\n",
    "        curReading = None\n",
    "        for (p, pLine) in enumerate(reversed(prevLines)):\n",
    "            if p == 0:\n",
    "                if not (pLine.isdigit() and not 0 < int(p) < 1000):\n",
    "                    print(f'ERROR at line {i + 1}: missing Borger number \"{pLine}\"')\n",
    "                    print(list(reversed(prevLines)))\n",
    "                    return False\n",
    "\n",
    "            else:\n",
    "                curReading = line\n",
    "                return True\n",
    "\n",
    "        if curReading is None:\n",
    "            print(f'ERROR at line {i + 1}: missing reading for uni \"{curUni}\"')\n",
    "            print(list(reversed(prevLines)))\n",
    "            return False\n",
    "\n",
    "        uniStrs = curUni.strip().split()\n",
    "        for uniStr in uniStrs:\n",
    "            uniGood = True\n",
    "            try:\n",
    "                int(uniStr, 16)\n",
    "            except Exception:\n",
    "                uniGood = False\n",
    "                break\n",
    "        if not uniGood:\n",
    "            print(f'ERROR at line {i + 1}: malformed unicode number \"{curUni}\"')\n",
    "            print(list(reversed(prevLines)))\n",
    "            return False\n",
    "        unis = {int(uniStr) for uniStr in uniStrs}\n",
    "        if len(unis) != len(uniStrs):\n",
    "            print(f'ERROR at line {i + 1}: identical unis in \"{curUni}\"')\n",
    "            print(list(reversed(prevLines)))\n",
    "            return False\n",
    "\n",
    "        for uniStr in uniStrs:\n",
    "            uniStr = uniStr.upper()\n",
    "            if uniStr in mapping:\n",
    "                print(f'ERROR at line {i + 1}: duplicate uni {uniStr} in \"{curUni}\"')\n",
    "                print(list(reversed(prevLines)))\n",
    "                return False\n",
    "\n",
    "            mapping[uniStr] = (curGrapheme, curReading)\n",
    "        return True\n",
    "\n",
    "    with open(SRC) as fh:\n",
    "        curUni = None\n",
    "        curGrapheme = None\n",
    "        prevLines = []\n",
    "\n",
    "        i = 0\n",
    "        for line in fh:\n",
    "            i += 1\n",
    "            line = line.strip()\n",
    "            if uniCandRe.match(line):\n",
    "                if curUni:\n",
    "                    if not finishUni():\n",
    "                        break\n",
    "                curUni = line\n",
    "                curGrapheme = None\n",
    "                prevLines = []\n",
    "                continue\n",
    "\n",
    "            if len(prevLines) == 0:\n",
    "                curGrapheme = line\n",
    "                prevLines.append(line)\n",
    "                continue\n",
    "\n",
    "            prevLines.append(line)\n",
    "            if len(prevLines) > 6:\n",
    "                print(f'ERROR at line {i + 1}: out of sync \"{line}\"')\n",
    "                print(list(reversed(prevLines)))\n",
    "                break\n",
    "\n",
    "        i += 1\n",
    "        good = finishUni()\n",
    "        print(f\"Seen {i - 1} lines\")\n",
    "\n",
    "    if good:\n",
    "        print(f\"{len(mapping)} unicode characters mapped\")\n",
    "    else:\n",
    "        print(\"ERROR detected\")\n",
    "    return mapping\n",
    "\n",
    "\n",
    "mapping = makeMapping()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "for (uni, (grapheme, reading)) in sorted(mapping.items()):\n",
    "    print(f'\"{chr(uni)}\" = {uni} = \"{grapheme}\" = \"{reading}\"')"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "encoding": "# -*- coding: utf-8 -*-"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}