{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Pseudo-random number generators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Properties of PRNGs\n", "\n", "+ dimension of output\n", " - commonly 32 bits, but some have more\n", " \n", "+ number of states\n", " - dimension of state space in bits\n", " - sometimes state = output, but better generators generally have output = f(state)\n", "\n", "+ period\n", " - maximum over initial states of the number of states visited before repeating\n", " - period ≤ number of states\n", " - if state has $s$ bits, period $\\le 2^s$\n", " - for some PRNGs, period is much less than number of states\n", " - for some seeds for some PRNGs, number of states visited is much less than period\n", " \n", "+ $k$-distribution\n", " - suppose $\\{X_i\\}$ is sequence of $P$ $w$-bit integers\n", " - define $t_v(X_i)$ to be the first $v$ bits of $X_i$\n", " - $\\{X_i\\}$ is $k$-distributed to $v$-bit accuracy if each of the $2^{kv}-1$ possible nonzero $kv$-bit vectors occurs equally often among the $P$ $kv$-bit vectors\n", "$$ (t_v(X_i),\\,t_v(X_{i+1}), \\ldots ,t_v(X_{i+k-1}))\\quad (0\\le i" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# plot the triples as points in R^3\n", "fig = plt.figure()\n", "ax = fig.add_subplot(111, projection='3d')\n", "ax.scatter(xs[0],xs[1], xs[2])\n", "\n", "plt.rcParams['figure.figsize'] = (18.0, 18.0) \n", "\n", "ax.view_init(-100,110)\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Wichmann-Hill (1982)\n", "\n", "Sum of 3 LCGs. Period is 6,953,607,871,644. \n", "\n", " def WH(s1, s2, s3):\n", " s1 = (171 * s1) % 30269\n", " s2 = (172 * s2) % 30307\n", " s3 = (170 * s3) % 30323\n", " r = (s1/30269 + s2/30307 + s3/30323) % 1\n", " return [r, s1, s2, s3]\n", "\n", "#### The right way, the wrong way, and the Microsoft way.\n", "WH generally not considered adequate for statistics, but was (nominally) the PRNG in Excel for several\n", "generations. Excel did not allow the seed to be set, so analyses were not reproducible. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![mcCullough](./RandPics/notWichmannHill08.png)\n", "McCullough, B.D., 2008. Microsoft Excel's 'Not The Wichmann–Hill' random number generators\n", "_Computational Statistics & Data Analysis_, _52_, 4587–4593\n", "doi:10.1016/j.csda.2008.03.006" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Mersenne Twister (MT) Matsumoto & Nishimura (1997)\n", "\n", "+ example of \"twisted generalized feedback shift register\"\n", "+ period $2^{19937}-1$, a Mersenne Prime\n", "+ $k$-distributed to 32-bit accuracy for all $k \\in \\{1, \\ldots, 623\\}$. \n", "+ passes DIEHARD and most of TestU01 (see below)\n", "+ standard in many packages:\n", " - GNU Octave, Maple, MATLAB, Mathematica, Python, R, Stata\n", " - Apache, CMU Common Lisp, Embeddable Common Lisp, Free Pascal, GLib, PHP, GAUSS, IDL, Julia, Ruby, SageMath, Steel Bank Common Lisp, Scilab, Stata, GNU Scientific Library, GNU Multiple Precision Arithmetic Library, Microsoft Visual C++.\n", " - SPSS and SAS offer MT, as does C++ (v11 and up)\n", "+ generally considered adequate for statistics (but not for cryptography); however, will trouble that in this work, esp. for \"big data\"\n", "+ usual implementation has 624-dimensional state space, but TinyMT uses only 127 bits\n", "+ seeding complicated, since state is an array\n", "+ can take a while to \"burn in,\" especially for seeds with many zeros\n", "+ output for close seed states can be close\n", "+ 2002 update improves seeding\n", "+ completely predictable from 624 successive outputs\n", "+ problems discovered in 2007 (see TestU01, below)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Python implementation of MT19937 from Wikipedia \n", "# https://en.wikipedia.org/wiki/Mersenne_Twister#Python_implementation\n", "\n", "def _int32(x):\n", " # Get the 32 least significant bits.\n", " return int(0xFFFFFFFF & x)\n", "\n", "class MT19937:\n", "\n", " def __init__(self, seed):\n", " # Initialize the index to 0\n", " self.index = 624\n", " self.mt = [0] * 624\n", " self.mt[0] = seed # Initialize the initial state to the seed\n", " for i in range(1, 624):\n", " self.mt[i] = _int32(\n", " 1812433253 * (self.mt[i - 1] ^ self.mt[i - 1] >> 30) + i)\n", "\n", " def extract_number(self):\n", " if self.index >= 624:\n", " self.twist()\n", "\n", " y = self.mt[self.index]\n", "\n", " # Right shift by 11 bits\n", " y = y ^ y >> 11\n", " # Shift y left by 7 and take the bitwise and of 2636928640\n", " y = y ^ y << 7 & 2636928640\n", " # Shift y left by 15 and take the bitwise and of y and 4022730752\n", " y = y ^ y << 15 & 4022730752\n", " # Right shift by 18 bits\n", " y = y ^ y >> 18\n", "\n", " self.index = self.index + 1\n", "\n", " return _int32(y)\n", "\n", " def twist(self):\n", " for i in range(624):\n", " # Get the most significant bit and add it to the less significant\n", " # bits of the next number\n", " y = _int32((self.mt[i] & 0x80000000) +\n", " (self.mt[(i + 1) % 624] & 0x7fffffff))\n", " self.mt[i] = self.mt[(i + 397) % 624] ^ y >> 1\n", "\n", " if y % 2 != 0:\n", " self.mt[i] = self.mt[i] ^ 0x9908b0df\n", " self.index = 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### xorshift family\n", "\n", "Originated by Marsaglia, 2003.\n", "\n", "Vigna, S., 2014. Further scramblings of Marsaglia's xorshift generators. https://arxiv.org/abs/1404.0390\n", "\n", "128-bit xorshift+ Implemented in Python package randomstate https://pypi.python.org/pypi/randomstate/1.10.1\n", "\n", " uint64_t s[2];\n", "\n", " uint64_t xorshift128plus(void) {\n", "\t uint64_t x = s[0];\n", "\t uint64_t const y = s[1];\n", "\t s[0] = y;\n", "\t x ^= x << 23; // a\n", "\t s[1] = x ^ y ^ (x >> 17) ^ (y >> 26); // b, c\n", "\t return s[1] + y;\n", " }\n", " \n", " \n", "1024-bit xorshift+\n", "\n", " uint64_t s[16];\n", " int p;\n", " \n", " uint64_t next(void) {\n", " const uint64_t s0 = s[p];\n", " uint64_t s1 = s[p = (p + 1) & 15];\n", " const uint64_t result = s0 + s1;\n", " s1 ^= s1 << 31; // a\n", " s[p] = s1 ^ s0 ^ (s1 >> 11) ^ (s0 >> 30); // b, c\n", " return result;\n", "}\n", "\n", "xorshift+ passes all the tests in BigCrush, has 128-bit state space and period $2^{128}-1$, but is \n", "only $(k-1)$-dimensionally equidistributed, where $k$ is the dimension of the distribution of the xorshift\n", "generator from which it's derived. E.g., for the 128-bit version, xorshift+ is only 1-dimensionally equidistributed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Other non-cryptographic PRNGs\n", "\n", "See http://www.pcg-random.org/ and the talk http://www.pcg-random.org/posts/stanford-colloquium-talk.html\n", "\n", "PCG family permutes the output of a LCG; good statistical properties and very fast and compact. Related to Rivest's RC5 cipher.\n", "\n", "Seems better than MT, xorshift+, et al. \n", "\n", " // *Really* minimal PCG32 code / (c) 2014 M.E. O'Neill / pcg-random.org\n", " // Licensed under Apache License 2.0 (NO WARRANTY, etc. see website)\n", "\n", " typedef struct { uint64_t state; uint64_t inc; } pcg32_random_t;\n", "\n", " uint32_t pcg32_random_r(pcg32_random_t* rng)\n", " {\n", " uint64_t oldstate = rng->state;\n", " // Advance internal state\n", " rng->state = oldstate * 6364136223846793005ULL + (rng->inc|1);\n", " // Calculate output function (XSH RR), uses old state for max ILP\n", " uint32_t xorshifted = ((oldstate >> 18u) ^ oldstate) >> 27u;\n", " uint32_t rot = oldstate >> 59u;\n", " return (xorshifted >> rot) | (xorshifted << ((-rot) & 31));\n", " }" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PRNGs based on cryptographic hash functions\n", "\n", "Cryptographic hash functions have several basic properties:\n", "\n", "1. produce fixed-length \"digest\" of an arbitrarily long \"message\": $H:\\{0, 1\\}^* \\rightarrow \\{0, 1\\}^L$.\n", "1. inexpensive to compute\n", "1. non-invertible (\"one-way,\" hard to find pre-image of any hash except by exhaustive enumeration)\n", "1. collision-resistant (hard to find $M_1 \\ne M_2$ such that $H(M_1) = H(M_2)$)\n", "1. small change to input produces big change to output (\"unpredictable,\" input and output effectively independent)\n", "1. equidistributed: bits of the hash are essentially random \n", "\n", "Summary: _as if_ $H(M)$ is random $L$-bit string is assigned to $M$ in a way that's essentially unique." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1 step of SHA-256\n", "By User:kockmeyer - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=1823488\n", "\n", "![SHA-2](./RandPics/SHA-2.svg)\n", "\n", "$$ \\mbox{Ch} (E,F,G) \\equiv (E\\land F)\\oplus (\\neg E\\land G) $$\n", "$$ \\mbox{Ma} (A,B,C) \\equiv (A\\land B)\\oplus (A\\land C)\\oplus (B\\land C) $$\n", "$$ \\Sigma _0 (A) \\equiv (A\\!\\ggg \\!2)\\oplus (A\\!\\ggg \\!13)\\oplus (A\\!\\ggg \\!22) $$\n", "$$ \\Sigma _1 (E) \\equiv (E\\!\\ggg \\!6)\\oplus (E\\!\\ggg \\!11)\\oplus (E\\!\\ggg \\!25) $$\n", "$$\\boxplus \\mbox{ is addition mod } 2^{32}$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Simple, hash-based PRNG\n", "\n", "Generate a random string $S$ of reasonable length, e.g., 20 digits.\n", "\n", "$$ X_i = {\\mbox{Hash}}(S+i),$$\n", "\n", "where $+$ denotes string concatenation, and the resulting string is interpreted as a (long) hexadecimal number.\n", "\n", "**\"Counter mode.\" Hash-based generators of this type have unbounded state spaces.**\n", "\n", "Implementation in Python by Ron Rivest: http://people.csail.mit.edu/rivest/sampler.py\n", "\n", "Implementation in angular-js by Chris Jerdonek: https://github.com/cjerdonek/quick-sampler\n", "\n", "Implementation in JavaScript by Philip Stark: https://www.stat.berkeley.edu/~stark/Java/Html/sha256Rand.htm\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Some trainwrecks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Microsoft Visual Studio system.random, 2011\n", "\n", "![Microsoft](./RandPics/visualStudiosysRand.png)\n", "\n", "https://connect.microsoft.com/VisualStudio/feedback/details/634761/system-random-serious-bug" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![XP2](./RandPics/xp2k.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aristocrat Leisure of Australia Slot Machines\n", "\n", "\"Wired" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"Russian" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dual Elliptic Curve\n", "\n", "![dual EC](./RandPics/dualEC-1.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "![dual EC](./RandPics/dualEC-2.png)\n", "Bernstein, D.J., T. Lange, and R. Niederhagen, 2016. Dual EC: A Standardized Backdoor, in _The New Codebreakers, Essays Dedicated to David Kahn on the Occasion of his 85th Birthday_, Ryan, P.Y.A., D. Naccache, and J-J Quisquater, eds., Springer, Berlin." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "## GnuPG RNG bug (18 years, 1998--2016) \n", "\n", "\"An attacker who obtains 4640 bits from the RNG can trivially predict the next 160 bits of output\" \n", "\n", "https://threatpost.com/gpg-patches-18-year-old-libgcrypt-rng-bug/119984/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## RSA\n", "\n", "https://www.schneier.com/blog/archives/2012/02/lousy_random_nu.html\n", "\n", "![Lenstra](./RandPics/lenstraEtal.png)\n", "\n", "\"An analysis comparing millions of RSA public keys gathered from the Internet was announced in 2012 by Lenstra, Hughes, Augier, Bos, Kleinjung, and Wachter. They were able to factor 0.2% of the keys using only Euclid's algorithm. They exploited a weakness unique to cryptosystems based on integer factorization. If n = pq is one public key and n′ = p′q′ is another, then if by chance p = p′, then a simple computation of gcd(n,n′) = p factors both n and n′, totally compromising both keys. Nadia Heninger, part of a group that did a similar experiment, said that the bad keys occurred almost entirely in embedded applications, and explains that the one-shared-prime problem uncovered by the two groups results from situations where the pseudorandom number generator is poorly seeded initially and then reseeded between the generation of the first and second primes.\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## PHP\n", "\n", "![php](./RandPics/argyrosKiayias.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Bitcoin on Android: Java nonce collision \n", "\n", "\"In August 2013, it was revealed that bugs in the Java class SecureRandom could generate collisions in the k nonce values used for ECDSA in implementations of Bitcoin on Android. When this occurred the private key could be recovered, in turn allowing stealing Bitcoins from the containing wallet.\"\n", "\n", "https://en.wikipedia.org/wiki/Random_number_generator_attack#Java_nonce_collision" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Debian OpenSSL\n", "\n", "![openSSL](./RandPics/openSSL.png)\n", "\n", "Valgrind and Purify warned about uninitialized data. \n", "Only remaining entropy in seed was the process ID. \n", "\n", "Default maximum process ID is 32,768 in Linux.\n", "\n", "Took 2 years (2006--2008) to notice the bug." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "XKCD \n", "![xkcd](https://imgs.xkcd.com/comics/security_holes.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "### Generating a random integer uniformly distributed on $\\{1, \\ldots, m\\}$\n", "\n", "\n", "#### Naive method\n", "A standard way to generate a random integer is to start with $X \\sim U[0,1)$ and define $Y \\equiv 1 + \\lfloor mX \\rfloor$. \n", "\n", "In theory, that's fine. But in practice, $X$ is not really $U[0,1)$ but instead is derived by normalizing a PRN\n", "that's uniform on $w$-bit integers. Then, unless $m$ is a power of 2, the distribution of $Y$ isn't uniform on $\\{1, \\ldots, m\\}$. For $m < 2^w$, the ratio of the largest to smallest selection probability \n", "is, to first order, $1+ m 2^{-w}$. (See, e.g., Knuth v2 3.4.1.A.)\n", "\n", "For $m = 10^9$ and $w=32$, $1 + m 2^{-w} \\approx 1.233$. That could easily matter.\n", "\n", "For $m > 2^{w}$, at least $m-2^w$ values will have probability 0 instead of probability $1/m$. \n", "\n", "If $w=32$, then for $m>2^{32}=4.24e9$, some values will have probability 0. Until relatively recently, R did not support 64-bit integers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### More accurate method\n", "A better way to generate a (pseudo-)random integer on $\\{1, \\ldots m\\}$ from a (pseudo-random) $w$-bit integer in practice is as follows:\n", "\n", "1. Set $\\mu = \\log_2(m-1)$.\n", "1. Define a $w$-bit _mask_ consisting of $\\mu$ bits set to 1 and $(w-\\mu)$ bits set to zero.\n", "1. Generate a random $w$-bit integer $Y$. \n", "1. Define $y$ to be the bitwise `and` of $Y$ and the mask.\n", "1. If $y \\le m-1$, output $x = y+1$; otherwise, return to step 3.\n", "\n", "This is how random integers are generated in numpy by `numpy.random.randint()`.\n", "However, `numpy.random.choice()` does something else that's biased: it finds the closest integer to $mX$.\n", "\n", "In `R`, one would generally use the function `sample(1:m, k, replace=FALSE)` to draw pseudo-random\n", "integers. \n", "It seems that `sample()` uses the faulty `1 + floor(m*X)` approach. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## R random integer generator / sample()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```\n", "if (dn > INT_MAX || k > INT_MAX) {\n", "\t PROTECT(y = allocVector(REALSXP, k));\n", "\t if (replace) {\n", "\t\tdouble *ry = REAL(y);\n", "\t\tfor (R_xlen_t i = 0; i < k; i++) ry[i] = floor(dn * ru() + 1);\n", " \t } else {\n", "#ifdef LONG_VECTOR_SUPPORT\n", "\t\tR_xlen_t n = (R_xlen_t) dn;\n", "\t\tdouble *x = (double *)R_alloc(n, sizeof(double));\n", "\t\tdouble *ry = REAL(y);\n", "\t\tfor (R_xlen_t i = 0; i < n; i++) x[i] = (double) i;\n", "\t\tfor (R_xlen_t i = 0; i < k; i++) {\n", "\t\t R_xlen_t j = (R_xlen_t)floor(n * ru());\n", "\t\t ry[i] = x[j] + 1;\n", "\t\t x[j] = x[--n];\n", "\t\t}\n", "#else\n", "\t\terror(_(\"n >= 2^31, replace = FALSE is only supported on 64-bit platforms\"));\n", "#endif\n", "\t }\n", "\t} else {\n", "\t int n = (int) dn;\n", "\t PROTECT(y = allocVector(INTSXP, k));\n", "\t int *iy = INTEGER(y);\n", "\t /* avoid allocation for a single sample */\n", "\t if (replace || k < 2) {\n", "\t\tfor (int i = 0; i < k; i++) iy[i] = (int)(dn * unif_rand() + 1);\n", "\t } else {\n", "\t\tint *x = (int *)R_alloc(n, sizeof(int));\n", "\t\tfor (int i = 0; i < n; i++) x[i] = i;\n", "\t\tfor (int i = 0; i < k; i++) {\n", " \t\t int j = (int)(n * unif_rand());\n", " \t\t iy[i] = x[j] + 1;\n", " \t\t x[j] = x[--n];\n", " \t\t}\n", " \t}\n", "```\n", "```\n", "/* Our PRNGs have at most 32 bit of precision, and all have at least 25 */\n", "\n", "static R_INLINE double ru()\n", "{\n", " double U = 33554432.0;\n", " return (floor(U*unif_rand()) + unif_rand())/U;\n", "}\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----\n", "## How can you tell whether a sequence is random?\n", "\n", "http://dilbert.com/strip/2001-10-25" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tests of PRNGS\n", "\n", "+ Theoretical analyses, e.g., Knuth (1969), Marsaglia (1968)\n", "\n", "+ Statistical tests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Knuth (1969) _The Art of Computer Programming_, v.2 \n", "\n", "+ 11 types of behavior: equidistribution, series, gaps, poker, coupon collector, permutation frequency, runs, max of $t$, collisions, birthday spacings, serial correlation\n", "+ tests on subsequences, spectral test\n", "+ Many $\\chi^2$-based tests\n", "+ Kolmogorov-Smirnov test for uniformity\n", "+ Sphere-packing\n", "+ **MORE**\n", "\n", "#### Marsaglia (1996) DIEHARD tests\n", "\n", "+ Birthday spacings\n", "+ Overlapping permutations of 5 random numbers\n", "+ Ranks of binary matrices of various dimensions\n", "+ Monkeys at typewriters: count overlapping \"words\" in strings of bits\n", "+ Count the 1s in bytes; translate to \"words.\"\n", "+ Parking lot test, 100 × 100 square lot. Count non-collisions.\n", "+ Minimum distance test: Min distance between 8,000 random points in a 10,000 × 10,000 square. \n", "+ Sphere-packing in a cube at random; diameter of smallest sphere.\n", "+ Squeeze test: Multiply 231 by random floats on (0,1) until hitting 1.\n", "+ Overlapping sums of 100 random (0,1) floats.\n", "+ Runs test for random floats\n", "+ #wins and #rolls in 200,000 games of craps\n", "\n", "#### L’Ecuyer and Simard (2007) TestU01 http://dl.acm.org/citation.cfm?doid=1268776.1268777\n", "\n", "+ Kolmogorov-Smirnov, Crámer-von Mises, Anderson-Darling, clustering, runs, gaps, hits in partition of a hypercube (collisions, empty cells, time between visits, ...), birthday spacings, close pairs, coupon collector, sum collector, complexity of bit strings (__linear complexity__, jump complexity, jump size complexity, Lempel-Ziv complexity), spectral tests on bit strings, autocorrelation of bits, runs and gaps in bits, ..., ranks of binary matrices, longest runs, Hamming weights, random walks, close pairs of binary sequences, \n", "\n", "### NIST Tests\n", "\n", "+ http://csrc.nist.gov/groups/ST/toolkit/rng/stats_tests.html\n", "+ http://csrc.nist.gov/groups/ST/toolkit/rng/documents/SP800-22rev1a.pdf\n", "\n", "![NIST](./RandPics/NISTTests.png)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }