{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Marshal as serialization of TF data\n",
    "\n",
    "[Marshal](https://docs.python.org/3/library/marshal.html#module-marshal)\n",
    "is a data serialization format used in the standard library of Python. It is more primitive,\n",
    "but it might be faster.\n",
    "\n",
    "As a simple test, we take the feature data for `g_word_utf8`.\n",
    "It is a map from the numbers 1 to 426584 to Hebrew word occurrences (Unicode strings).\n",
    "\n",
    "In Text-Fabric we have a representation in plain text and a compressed, pickled representation.\n",
    "\n",
    "# Outcome\n",
    "\n",
    "Pickle is faster. Loading gzipped, pickled data is *much* faster than loading gzipped, marshalled data.\n",
    "\n",
    "The size of the marshal uncompressed serialization is much bigger than the TF text representation.\n",
    "\n",
    "The size of the gzipped marshal serialization is approximately the same as the gzipped, pickled TF serialization.\n",
    "\n",
    "# Detailed comparison\n",
    "\n",
    "name | kind | size | load time\n",
    ":--- | :--- | ---: | ---:\n",
    "g_word_utf8.tf | tf: plain unicode text | 5.4 MB | 1.6 s\n",
    "g_word_utf8.tfx | tf: gzipped binary |3.2 MB | 0.2 s\n",
    "g_word_utf8.joblib | marshal: uncompressed | 9.2 MB | 0.8 s\n",
    "g_word_utf8.joblib.gz | marshal: gzipped | 3.3 MB | 3.0 s\n",
    "\n",
    "# Conclusion\n",
    "\n",
    "**We do not see reasons to replace the TF feature data serialization by marshal.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-05-09T13:52:56.726798Z",
     "start_time": "2018-05-09T13:52:56.660991Z"
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import gzip\n",
    "import marshal\n",
    "import pickle\n",
    "\n",
    "from tf.fabric import Fabric\n",
    "\n",
    "GZIP_LEVEL = 2 # same as used in Text-Fabric"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load from the textual data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-05-09T13:53:03.953876Z",
     "start_time": "2018-05-09T13:52:57.594126Z"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "VERSION = 'c'\n",
    "BHSA = f'BHSA/tf/{VERSION}'\n",
    "PARA = f'parallels/tf/{VERSION}'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-05-09T13:53:03.953876Z",
     "start_time": "2018-05-09T13:52:57.594126Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This is Text-Fabric 5.5.22\n",
      "Api reference : https://annotation.github.io/text-fabric/Api/Fabric/\n",
      "Tutorial      : https://github.com/annotation/text-fabric/blob/master/docs/tutorial.ipynb\n",
      "Example data  : https://github.com/annotation/text-fabric-data\n",
      "\n",
      "117 features found and 0 ignored\n",
      "  0.00s loading features ...\n",
      "   |     1.60s T g_word_utf8          from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "  6.02s All features loaded/computed - for details use loadLog()\n"
     ]
    }
   ],
   "source": [
    "TF = Fabric(locations='~/github/etcbc', modules=[BHSA, PARA])\n",
    "api = TF.load('')\n",
    "api.makeAvailableIn(globals())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The load time is **~ 1.6 seconds**.\n",
    "\n",
    "But during this time, the textual data has been compiled and written to a binary form.\n",
    "Let's load again.\n",
    "\n",
    "## Load from binary data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This is Text-Fabric 5.5.22\n",
      "Api reference : https://annotation.github.io/text-fabric/Api/Fabric/\n",
      "Tutorial      : https://github.com/annotation/text-fabric/blob/master/docs/tutorial.ipynb\n",
      "Example data  : https://github.com/annotation/text-fabric-data\n",
      "\n",
      "117 features found and 0 ignored\n",
      "  0.00s loading features ...\n",
      "  4.65s All features loaded/computed - for details use loadLog()\n"
     ]
    }
   ],
   "source": [
    "TF = Fabric(locations='~/github/etcbc', modules=[BHSA, PARA])\n",
    "api = TF.load('')\n",
    "api.makeAvailableIn(globals())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   |     0.03s B otype                from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.53s B oslots               from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.01s B book                 from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.01s B chapter              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.01s B verse                from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.13s B g_cons               from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.18s B g_cons_utf8          from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.14s B g_lex                from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.25s B g_lex_utf8           from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.22s B g_word               from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.26s B g_word_utf8          from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.14s B lex0                 from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.17s B lex_utf8             from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B qere                 from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B qere_trailer         from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B qere_trailer_utf8    from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B qere_utf8            from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.07s B trailer              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.08s B trailer_utf8         from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B __levels__           from otype, oslots, otext\n",
      "   |     0.03s B __order__            from otype, oslots, __levels__\n",
      "   |     0.03s B __rank__             from otype, __order__\n",
      "   |     1.11s B __levUp__            from otype, oslots, __rank__\n",
      "   |     0.87s B __levDown__          from otype, __levUp__, __rank__\n",
      "   |     0.33s B __boundary__         from otype, oslots, __rank__\n",
      "   |     0.01s B __sections__         from otype, oslots, otext, __levUp__, __levels__, book, chapter, verse\n",
      "   |     0.00s B book@am              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@ar              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@bn              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@da              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@de              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@el              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@en              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@es              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@fa              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@fr              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@he              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@hi              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@id              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@ja              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@ko              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@la              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@nl              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@pa              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@pt              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@ru              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@sw              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@syc             from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@tr              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@ur              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@yo              from /Users/dirk/github/etcbc/BHSA/tf/c\n",
      "   |     0.00s B book@zh              from /Users/dirk/github/etcbc/BHSA/tf/c\n"
     ]
    }
   ],
   "source": [
    "loadLog()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The load time of the feature `g_word_utf8` is **~ 0.2 seconds**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Make an marshal feature data file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "tempDir = os.path.expanduser('~/github/annotation/text-fabric/_temp/marshal')\n",
    "os.makedirs(tempDir, exist_ok=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "426584\n",
      "רֵאשִׁ֖ית\n"
     ]
    }
   ],
   "source": [
    "feature = 'g_word_utf8'\n",
    "data =  TF.features[feature].data\n",
    "print(len(data))\n",
    "print(data[2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We write the feature data to an Avro data file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataFile = f'{tempDir}/{feature}.marshal'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  0.00s start writing\n",
      "  0.09s done\n"
     ]
    }
   ],
   "source": [
    "indent(reset=True)\n",
    "info('start writing')\n",
    "with open(dataFile, 'wb') as mf:\n",
    "  marshal.dump(data, mf)\n",
    "info('done')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We make also a gzipped data file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  0.00s start writing\n",
      "  0.26s done\n"
     ]
    }
   ],
   "source": [
    "indent(reset=True)\n",
    "info('start writing')\n",
    "dataFileZ = f'{dataFile}.gz'\n",
    "with gzip.open(dataFileZ, 'wb', compresslevel=GZIP_LEVEL) as mf:\n",
    "  marshal.dump(data, mf)\n",
    "info('done')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load from marshal file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  0.00s start reading\n",
      "  0.83s done\n",
      "רֵאשִׁ֖ית\n"
     ]
    }
   ],
   "source": [
    "indent(reset=True)\n",
    "info('start reading')\n",
    "with open(dataFile, 'rb') as mf:\n",
    "  rData = marshal.load(mf)\n",
    "info('done')\n",
    "print(rData[2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load time **~ 0.8 seconds**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load from marshal file (gzipped)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  0.00s start reading\n",
      "  3.04s done\n",
      "רֵאשִׁ֖ית\n"
     ]
    }
   ],
   "source": [
    "indent(reset=True)\n",
    "info('start reading')\n",
    "with gzip.open(dataFileZ, 'rb') as mf:\n",
    "  rData = marshal.load(mf)\n",
    "info('done')\n",
    "print(rData[2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load time **~ 3.0 seconds**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  0.00s start reading\n",
      "  3.05s done\n",
      "רֵאשִׁ֖ית\n"
     ]
    }
   ],
   "source": [
    "indent(reset=True)\n",
    "info('start reading')\n",
    "with gzip.open(dataFileZ, 'rb') as mf:\n",
    "  rData = marshal.load(mf)\n",
    "info('done')\n",
    "print(rData[2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Direct comparison with pickle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "tfDataFileZ = os.path.expanduser('~/github/etcbc/bhsa/tf/c/.tf/g_word_utf8.tfx')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  0.00s start reading\n",
      "  0.27s done\n",
      "רֵאשִׁ֖ית\n"
     ]
    }
   ],
   "source": [
    "indent(reset=True)\n",
    "info('start reading')\n",
    "with gzip.open(tfDataFileZ, 'rb') as mf:\n",
    "  rData = pickle.load(mf)\n",
    "info('done')\n",
    "print(rData[2])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}