{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Peek into LAF-Fabric binary\n",
    "\n",
    "We'll open a binary feature file and have a look inside."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "ROOT = '/Users/dirk/laf/laf-fabric-data/etcbc4b/bin'\n",
    "FILE = 'Fn0(etcbc4,ft,lex)'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The file is a gzipped file with a serialized Python datastructure in it.\n",
    "The Python way of serializing is called pickling, and the result is a pickled data structure.\n",
    "It can serve many of the purposes that json serves, except that pickled data is binary, so it is not that transparent.\n",
    "Unlike json, pickle has space optimizations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pickle, gzip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "with gzip.open('{}/{}'.format(ROOT, FILE), \"rb\") as f:\n",
    "    data = pickle.load(f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There we are, we have a big chunk of data in the variable `data` now."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "type: <class 'dict'>\n"
     ]
    }
   ],
   "source": [
    "print('type: {}'.format(type(data)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So it is a dictionary. Let's examine the keys."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "426568 keys\n"
     ]
    }
   ],
   "source": [
    "print('{} keys'.format(len(data)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's a familiar number, the number of monads (words) in the Hebrew bible.\n",
    "Let's show the first 20 keys and their values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     0: \"B\"\n",
      "     1: \"R>CJT/\"\n",
      "     2: \"BR>[\"\n",
      "     3: \">LHJM/\"\n",
      "     4: \">T\"\n",
      "     5: \"H\"\n",
      "     6: \"CMJM/\"\n",
      "     7: \"W\"\n",
      "     8: \">T\"\n",
      "     9: \"H\"\n",
      "    10: \">RY/\"\n",
      "    11: \"W\"\n",
      "    12: \"H\"\n",
      "    13: \">RY/\"\n",
      "    14: \"HJH[\"\n",
      "    15: \"THW/\"\n",
      "    16: \"W\"\n",
      "    17: \"BHW/\"\n",
      "    18: \"W\"\n",
      "    19: \"XCK/\"\n"
     ]
    }
   ],
   "source": [
    "print('\\n'.join('{:>6}: \"{}\"'.format(*x) for x in sorted(data.items())[0:20]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now without the numbers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "B\n",
      "R>CJT/\n",
      "BR>[\n",
      ">LHJM/\n",
      ">T\n",
      "H\n",
      "CMJM/\n",
      "W\n",
      ">T\n",
      "H\n",
      ">RY/\n",
      "W\n",
      "H\n",
      ">RY/\n",
      "HJH[\n",
      "THW/\n",
      "W\n",
      "BHW/\n",
      "W\n",
      "XCK/\n"
     ]
    }
   ],
   "source": [
    "print('\\n'.join(x[1] for x in sorted(data.items())[0:20]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "BN/\n",
      "KL/\n",
      "H\n",
      "JWM/\n",
      "B\n",
      "JWM/\n",
      "PC<[\n",
      ">DWM/\n",
      "MN\n",
      "TXT/\n",
      "JD/\n",
      "JHWDH/\n",
      "W\n",
      "MLK[\n",
      "<L\n",
      "MLK/\n",
      "W\n",
      "<BR[\n",
      "JWRM/\n",
      "Y<JR=/\n"
     ]
    }
   ],
   "source": [
    "print('\\n'.join(x[1] for x in sorted(data.items())[200000:200020]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}