{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": "true" }, "source": [ "# Table of Contents\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Manual implementation of some hash functions\n", "\n", "This small [Jupyter notebook](https://www.Jupyter.org/) is a short experiment, to see if I can implement the some basic [Hashing functions](https://en.wikipedia.org/wiki/Hash_function), more specifically [cryptographic hashing functions](https://en.wikipedia.org/wiki/Cryptographic_hash_function), like `MD5`, `SHA1`, `SHA256`, `SHA512` etc\n", "\n", "And then I want compare my manual implementations with the functions implemented in [the `hashlib` module in Python standard library](https://docs.python.org/3/library/hashlib.html).\n", "Ideally, my implementation should work exactly like the reference one, only slower!\n", "\n", "\n", "- *Reference*: Wikipedia pages on [Hash functions](https://en.wikipedia.org/wiki/Hash_function), [MD5](https://en.wikipedia.org/wiki/MD5), [SHA-1](https://en.wikipedia.org/wiki/SHA1) and [SHA-2](https://en.wikipedia.org/wiki/SHA-2), as well as [the `hashlib` module in Python standard library](https://docs.python.org/3/library/hashlib.html).\n", "- *Date*: 13 May 2017 (first part about MD5), 19 June 2017 (second part about SHA-1), 19 June 2017 (last part about SHA-2).\n", "- *Author*: [Lilian Besson](https://GitHub.com/Naereen/notebooks).\n", "- *License*: [MIT Licensed](https://LBesson.MIT-License.org/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "## What is a hash function?\n", "> TL;DR : [Hash functions](https://en.wikipedia.org/wiki/Hash_function) and [cryptographic hashing functions](https://en.wikipedia.org/wiki/Cryptographic_hash_function) on Wikipedia." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "## Common API for the different classes\n", "\n", "I will copy the API proposed by [the `hashlib` module in Python standard library](https://docs.python.org/3/library/hashlib.html), so it will be very easy to compare my implementations with the one provided with your default [Python](https://www.Python.org/) installation." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class Hash(object):\n", " \"\"\" Common class for all hash methods.\n", " \n", " It copies the one of the hashlib module (https://docs.python.org/3.5/library/hashlib.html).\n", " \"\"\"\n", " \n", " def __init__(self, *args, **kwargs):\n", " \"\"\" Create the Hash object.\"\"\"\n", " self.name = self.__class__.__name__ # https://docs.python.org/3.5/library/hashlib.html#hashlib.hash.name\n", " self.byteorder = 'little'\n", " self.digest_size = 0 # https://docs.python.org/3.5/library/hashlib.html#hashlib.hash.digest_size\n", " self.block_size = 0 # https://docs.python.org/3.5/library/hashlib.html#hashlib.hash.block_size\n", "\n", " def __str__(self):\n", " return self.name\n", " \n", " def update(self, arg):\n", " \"\"\" Update the hash object with the object arg, which must be interpretable as a buffer of bytes.\"\"\"\n", " pass\n", "\n", " def digest(self):\n", " \"\"\" Return the digest of the data passed to the update() method so far. This is a bytes object of size digest_size which may contain bytes in the whole range from 0 to 255.\"\"\"\n", " return b\"\"\n", "\n", " def hexdigest(self):\n", " \"\"\" Like digest() except the digest is returned as a string object of double length, containing only hexadecimal digits. This may be used to exchange the value safely in email or other non-binary environments.\"\"\"\n", " digest = self.digest()\n", " raw = digest.to_bytes(self.digest_size, byteorder=self.byteorder)\n", " format_str = '{:0' + str(2 * self.digest_size) + 'x}'\n", " return format_str.format(int.from_bytes(raw, byteorder='big'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "## Checking the [the `hashlib` module in Python standard library](https://docs.python.org/3/library/hashlib.html)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import hashlib" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check [the available algorithms](https://docs.python.org/3.5/library/hashlib.html#hashlib.algorithms_available), some of them being [guaranteed to be on any platform](https://docs.python.org/3.5/library/hashlib.html#hashlib.algorithms_guaranteed), some are not." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['ripemd160',\n", " 'MD4',\n", " 'SHA224',\n", " 'sha384',\n", " 'md4',\n", " 'DSA-SHA',\n", " 'SHA256',\n", " 'SHA384',\n", " 'MD5',\n", " 'whirlpool',\n", " 'sha',\n", " 'sha1',\n", " 'RIPEMD160',\n", " 'DSA',\n", " 'dsaWithSHA',\n", " 'dsaEncryption',\n", " 'sha224',\n", " 'SHA',\n", " 'sha512',\n", " 'md5',\n", " 'SHA512',\n", " 'ecdsa-with-SHA1',\n", " 'SHA1',\n", " 'sha256']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(hashlib.algorithms_available)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I will need at least these ones:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "assert 'MD5' in hashlib.algorithms_available\n", "assert 'SHA1' in hashlib.algorithms_available\n", "assert 'SHA256' in hashlib.algorithms_available\n", "assert 'SHA224' in hashlib.algorithms_available\n", "assert 'SHA512' in hashlib.algorithms_available\n", "assert 'SHA384' in hashlib.algorithms_available" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets check that they have the block size and digest size announced:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "For MD5 : the block size is 64 and the digest size is 16.\n", "For SHA1 : the block size is 64 and the digest size is 20.\n", "For SHA256 : the block size is 64 and the digest size is 32.\n", "For SHA224 : the block size is 64 and the digest size is 28.\n", "For SHA512 : the block size is 128 and the digest size is 64.\n", "For SHA384 : the block size is 128 and the digest size is 48.\n" ] } ], "source": [ "for name, s in zip(\n", " ('MD5', 'SHA1', 'SHA256', 'SHA224', 'SHA512', 'SHA384'),\n", " (hashlib.md5(), hashlib.sha1(), hashlib.sha256(), hashlib.sha224(), hashlib.sha512(), hashlib.sha384())\n", " ):\n", " print(\"For {:<8} : the block size is {:<3} and the digest size is {:<2}.\".format(name, s.block_size, s.digest_size))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "## First stupid example: a stupid hashing function\n", "\n", "This \"stupid\" hashing function will use `digest_size` of 128 bytes (= 16 bits), and compute it by ... just looking at the first 128 bytes of the input data.\n", "\n", "This is just to check the API and how to read from a bytes buffer." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class HeaderHash(Hash):\n", " \"\"\" This \"stupid\" hashing function will use `digest_size` of 128 bytes, and compute it by ... just looking at the first 128 bytes of the input data.\n", " \"\"\"\n", " \n", " def __init__(self):\n", " # Common part\n", " self.digest_size = 16\n", " self.block_size = 16\n", " self.name = \"Header\"\n", " # Specific part\n", " self._data = b\"\"\n", "\n", " def update(self, arg):\n", " \"\"\" Update the hash object with the object arg, which must be interpretable as a buffer of bytes.\"\"\"\n", " if len(self._data) == 0:\n", " self._data = arg[:self.block_size]\n", "\n", " def digest(self):\n", " \"\"\" Return the digest of the data passed to the update() method so far. This is a bytes object of size digest_size which may contain bytes in the whole range from 0 to 255.\"\"\"\n", " return self._data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us try it:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "h1 = HeaderHash()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<__main__.HeaderHash at 0x7fc84c200630>" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "Header\n" ] } ], "source": [ "h1\n", "print(h1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us use some toy data, to test here and after." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data = b\"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ\" * 100" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "b'0123456789ABCDEF'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "h1.update(data)\n", "h1.digest()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Well... It seems to work, even if this first example is stupid." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "## First real example: the MD5 hashing function\n", "Let start with a simple one: [the MD5 hashing function](https://en.wikipedia.org/wiki/MD5), from Rivest in 1992.\n", "\n", "*Note:* [This page helped for debugging](https://rosettacode.org/wiki/MD5/Implementation#Python).