{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true, "pycharm": { "name": "#%% md\n" } }, "source": [ "# In memory ZIP\n", "\n", "I recently ran into an issue with a large amount of JSON events, where the volume of data was ridicoulus simply because of the redundancy in the dataset.\n", "\n", "Hearing Raymond Hettingers words inside my head: \"There Must Be a Better Way!\" I decided to spend some time shrinking the data.\n", "\n", "There were 3 steps to the process:\n", "\n", "1. First convert redundant JSON, such as `{'x': ..., 'y': ..., 'z':...}` to tuples.\n", "2. Second to do a diff between the each objects state and only keep the changes.\n", "3. Compress the JSON as quickly as possible.\n", "\n", "The two first parts are more of less trivial. The JSON is a dict in python so doing a diff is a dict comparison\n" ] }, { "cell_type": "code", "source": [ "def dict_comp(A, B):\n", " \"\"\" helper for comparing json like dicts\"\"\"\n", " C = {}\n", " for k, v1 in A.items():\n", " v2 = B.get(k, None)\n", " if v2 is None:\n", " C[k]=v1\n", " elif v1!=v2:\n", " C[k]=v2\n", " elif isinstance(v1,dict):\n", " C.update(dict_comp(v1, v2))\n", " else:\n", " continue\n", " return C" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "execution_count": 1, "outputs": [] }, { "cell_type": "code", "execution_count": 2, "outputs": [ { "data": { "text/plain": "{'Alice': 0, 'Bob': {'one': 1, 'two': 0}}" }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "A = {'Alice': 1, 'Bob': {'one': 1, 'two':2}}\n", "B = {'Alice': 0, 'Bob': {'one': 1, 'two': 0}}\n", "\n", "dict_comp(A,B)" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Here we only see the changes.\n", "\n", "The third part is probably a little more novel. Let's start with the requirements:\n", "\n", "1. To have a class with a simple api.\n", "2. To keep everything in memory (speed)\n", "3. To be able to append more data to the ZIP\n", "4. To be able to write the zip from memory to disk as a single linear write.\n", "5. To be able to lazily load the zip.\n", "6. To be able to iterate over the filenames in the zip\n", "7. To be able to load the data from the zip using the name as a path.\n", "\n", "Here's the whole thing:" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 3, "outputs": [], "source": [ "import zipfile, io\n", "\n", "\n", "class InMemoryZip(object):\n", " def __init__(self):\n", " # Create the in-memory file-like object\n", " self.in_memory_zip = io.BytesIO()\n", " self._path = None\n", "\n", " def load(self, path):\n", " if not isinstance(path, pathlib.Path):\n", " raise TypeError(f\"{path} is not a path object\")\n", " if not path.name.lower().endswith('zip'):\n", " raise ValueError(f\"{path} is not a zip\")\n", "\n", " self._path = path\n", "\n", " def __iter__(self):\n", " assert isinstance(self._path, pathlib.Path)\n", " zf = zipfile.ZipFile(self._path)\n", " for name in zf.namelist():\n", " yield name\n", "\n", " def __getitem__(self, item):\n", " zf = zipfile.ZipFile(self._path)\n", " if item not in zf.namelist():\n", " raise KeyError(f\"no such file: {item}\")\n", " return zf.read(item)\n", "\n", " def append(self, filename_in_zip, file_contents):\n", " \"\"\"Appends a file with name filename_in_zip and contents of\n", " file_contents to the in-memory zip.\"\"\"\n", " # Get a handle to the in-memory zip in append mode\n", " zf = zipfile.ZipFile(self.in_memory_zip, \"a\", zipfile.ZIP_DEFLATED, False)\n", "\n", " # Write the file to the in-memory zip\n", " zf.writestr(filename_in_zip, file_contents)\n", "\n", " # Mark the files as having been created on Windows so that\n", " # Unix permissions are not inferred as 0000\n", " for zfile in zf.filelist:\n", " zfile.create_system = 0\n", "\n", " return self\n", "\n", " def read(self):\n", " \"\"\"Returns a string with the contents of the in-memory zip.\"\"\"\n", " self.in_memory_zip.seek(0)\n", " return self.in_memory_zip.read()\n", "\n", " def write(self, path):\n", " \"\"\"Writes the in-memory zip to a file.\"\"\"\n", " if not isinstance(path, pathlib.Path):\n", " raise TypeError\n", " with path.open('wb') as fo:\n", " fo.write(self.read())\n" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "To test that it works, let's first get the imports out of the way:" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 4, "outputs": [], "source": [ "import io\n", "import pathlib\n", "import tempfile" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Next, let's create some data and store it in memory" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 5, "outputs": [ { "data": { "text/plain": "<__main__.InMemoryZip at 0x28aa67d12b0>" }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "imz = InMemoryZip()\n", "bytestream = io.BytesIO(b\"123 123 \")\n", "bytestream.seek(0)\n", "imz.append('a/first', bytestream.read())\n", "\n", "bytestream = io.BytesIO(b\"123 456 \")\n", "bytestream.seek(0)\n", "imz.append('a/second', bytestream.read())" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Now let's write it to disk" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 6, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "io_test.zip 222 bytes\n" ] } ], "source": [ "\n", "tempdir = tempfile.gettempdir()\n", "path = pathlib.Path(tempdir) / \"io_test.zip\"\n", "imz.write(path)\n", "with path.open('rb') as fi:\n", " print(path.name, len(fi.read()), \"bytes\")" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Finally let's load it from disk" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 7, "outputs": [], "source": [ "\n", "imz = InMemoryZip()\n", "imz.load(path)\n", "\n", "names = [name for name in imz]\n", "assert len(names) == 2\n", "first = imz['a/first']\n", "assert first == b\"123 123 \"\n", "second = imz['a/second']\n", "assert second == b\"123 456 \"" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "And finally clean up the file system." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } }, { "cell_type": "code", "execution_count": 8, "outputs": [], "source": [ "path.unlink()" ], "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } } }, { "cell_type": "markdown", "source": [ "Simple." ], "metadata": { "collapsed": false, "pycharm": { "name": "#%% md\n" } } } ], "metadata": { "kernelspec": { "name": "python3", "language": "python", "display_name": "Python 3 (ipykernel)" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }