{ "metadata": { "name": "fuzzy_hash_micro" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Demo POC using fuzzy hash of two files and comparing them. \n", "\n", "First part uses python sets for similarity comparison of files.\n", "set objects will get large when storing thousands of sets.\n", "\n", "Second part uses\n", "Data is stored in a hyperloglog data structure.\n", "\n", "Third part will use redis for persistent storage.\n", "WIP: Redis integration and Jaccard based similarity index\n" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Setup of fuzzy hasher" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import time\n", "import struct\n", "import hashlib" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Open a file in a variable memory_dump" ] }, { "cell_type": "code", "collapsed": false, "input": [ "filename = \"test.pdf\"\n", "filename2 = \"Python_for_Data_Analysis.pdf\"" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "file_dump = open(filename, \"rb\")\n", "file_dump2 = open(filename2, \"rb\")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "file_dump" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 7, "text": [ "" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "file_dump2" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 8, "text": [ "" ] } ], "prompt_number": 8 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Here I build a byte reader data structure to buffer reads 18 bytes at a time" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def byte_reader(memory_dump, number_bytes):\n", " '''\n", " Read the #number_byte of bytes\n", " '''\n", " byte = memory_dump.read(number_bytes)\n", " return byte" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "byte_reader(file_dump, 32)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 10, "text": [ "'%PDF-1.6\\r%\\xe2\\xe3\\xcf\\xd3\\r\\n10792 0 obj\\r< 0:\n", " raise ValueError(\"maxCardinality must be > 0\")\n", "\n", " self._maxCardinality = maxCardinality\n", " self._k = int(round(log(pow(1.05/error_rate,2),2)))\n", " self._bucketNumber = 1< 0:\n", " raise ValueError(\"maxCardinality must be > 0\")\n", "\n", " self._maxCardinality = maxCardinality\n", " self._k = int(round(log(pow(1.30/error_rate,2),2)))\n", " self._bucketNumber = 1< 0:\n", " raise ValueError(\"maxCardinality must be > 0\")\n", "\n", " self._maxCardinality = maxCardinality\n", " self._k = int(round(log(pow(1.04/error_rate,2),2)))\n", " self._bucketNumber = 1<>1) + str(number&1)\n", "\n", "\n", "\n", "######################################\n", "#\n", "# make a bunch of hyperloglog sketches\n", "#\n", "######################################\n", "a = HyperLogLogSketch(2000000,0.05)\n", "b = HyperLogLogSketch(2000000,0.05)\n", "c = HyperLogLogSketch(2000000,0.05)\n", " \n", "for i in xrange(100000):\n", " a.add(str(i))\n", "for i in xrange(1500):\n", " b.add(str(i))\n", "for i in xrange(100000,200000):\n", " c.add(str(i))\n", " \n", "#print sys.getsizeof(a)\n", "print \"1-100,000 random items put in set - Estimated count: \", a.getNumberEstimate()\n", "print \"1500 random items put in set - Estimated count: \", b.getNumberEstimate()\n", "print \"100,000-200,000 random items put in set - Estimated count: \", c.getNumberEstimate()\n", "print \"Making a joined set with items numbered 1-100k and 100k-200k\", c.join(a,a)\n", "print \"Here is the joined count: \", c.getNumberEstimate()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1-100,000 random items put in set - Estimated count: 99402.5499907\n", "1500 random items put in set - Estimated count: 1530.86693324\n", "100,000-200,000 random items put in set - Estimated count: 109661.677392\n", "Making a joined set with items numbered 1-100k and 100k-200k None\n", "Here is the joined count: 208895.418605\n" ] } ], "prompt_number": 24 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Thousands of hashes stored into HyperLogLog Sketch" ] }, { "cell_type": "code", "collapsed": false, "input": [ "\"\"\"\n", "Demo to hash a bunch of blocks upto 1,000,000 blocks of 32 bytes blocks into Hyperloglog \n", "\"\"\"\n", "\n", "hash_store = HyperLogLogSketch(2000000,0.05)\n", "\n", "fd = open(\"/Users/antigen/Downloads/Python_for_Data_Analysis.pdf\", \"rb\")\n", "i=0\n", "\n", "for element in xrange(0,1000000):\n", " buffer, hash_result = hashing_byte_reader(fd, 16)\n", " hash_store.add(hash_result)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 25 }, { "cell_type": "code", "collapsed": false, "input": [ "hash_store.getNumberEstimate()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "pyout", "prompt_number": 26, "text": [ "826962.62820048362" ] } ], "prompt_number": 26 }, { "cell_type": "code", "collapsed": false, "input": [ "#TODO: Need to figure out the intersection of two hash_stores" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 50 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }