{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": "true" }, "source": [ "# Table of Contents\n", "

1  Computer arithmetics
1.1  Units of computer storage
1.2  Storage of Characters
1.3  Integers: fixed-point number system
1.3.1  Signed integers
1.3.2  Unsigned integers
1.3.3  BigInt
1.3.4  Overflow and underflow for integer arithmetic
1.4  Real numbers: floating-number system
1.4.1  Double precision (Float64)
1.4.2  Single precision (Float32)
1.4.3  Half precision (Float16)
1.4.4  Special floating-point numbers.
1.4.5  Rounding
1.4.6  Summary
1.4.7  Overflow and underflow of floating-point number
1.5  Catastrophic cancellation
1.5.1  Algebraic laws
1.6  Further reading
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Julia Version 1.1.0\n", "Commit 80516ca202 (2019-01-21 21:24 UTC)\n", "Platform Info:\n", " OS: macOS (x86_64-apple-darwin14.5.0)\n", " CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz\n", " WORD_SIZE: 64\n", " LIBM: libopenlibm\n", " LLVM: libLLVM-6.0.1 (ORCJIT, skylake)\n", "Environment:\n", " JULIA_EDITOR = code\n" ] } ], "source": [ "versioninfo()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Computer arithmetics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Units of computer storage\n", "\n", "* `bit` = `binary` + `digit` (coined by statistician [John Tukey](https://en.wikipedia.org/wiki/Bit#History)). \n", "* `byte` = 8 bits. \n", "* KB = kilobyte = $10^3$ bytes. \n", "* MB = megabytes = $10^6$ bytes. \n", "* GB = gigabytes = $10^9$ bytes. Typical RAM size. \n", "* TB = terabytes = $10^{12}$ bytes. Typical hard drive size. Size of NYSE each trading session. \n", "* PB = petabytes = $10^{15}$ bytes. \n", "* EB = exabytes = $10^{18}$ bytes. Size of all healthcare data in 2011 is ~150 EB. \n", "* ZB = zetabytes = $10^{21}$ bytes. \n", "\n", "Julia function `Base.summarysize` shows the amount of memory (in bytes) used by an object." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "80040" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = rand(100, 100)\n", "Base.summarysize(x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`varinfo()` function prints all variables in workspace and their sizes." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/latex": [ "\\begin{tabular}\n", "{l | r | l}\n", "name & size & summary \\\\\n", "\\hline\n", "Base & & Module \\\\\n", "Core & & Module \\\\\n", "Main & & Module \\\\\n", "x & 78.164 KiB & 100×100 Array\\{Float64,2\\} \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "| name | size | summary |\n", "|:---- | ----------:|:------------------------ |\n", "| Base | | Module |\n", "| Core | | Module |\n", "| Main | | Module |\n", "| x | 78.164 KiB | 100×100 Array{Float64,2} |\n" ], "text/plain": [ "name size summary \n", "–––– –––––––––– ––––––––––––––––––––––––\n", "Base Module \n", "Core Module \n", "Main Module \n", "x 78.164 KiB 100×100 Array{Float64,2}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "varinfo() # similar to Matlab whos()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Storage of Characters\n", "\n", "* Plain text files are stored in the form of characters: `.jl`, `.r`, `.c`, `.cpp`, `.ipynb`, `.html`, `.tex`, ... \n", "* ASCII (American Code for Information Interchange): 7 bits, only $2^7=128$ characters. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "128×2 Array{Any,2}:\n", " 0 '\\0' \n", " 1 '\\x01'\n", " 2 '\\x02'\n", " 3 '\\x03'\n", " 4 '\\x04'\n", " 5 '\\x05'\n", " 6 '\\x06'\n", " 7 '\\a' \n", " 8 '\\b' \n", " 9 '\\t' \n", " 10 '\\n' \n", " 11 '\\v' \n", " 12 '\\f' \n", " ⋮ \n", " 116 't' \n", " 117 'u' \n", " 118 'v' \n", " 119 'w' \n", " 120 'x' \n", " 121 'y' \n", " 122 'z' \n", " 123 '{' \n", " 124 '|' \n", " 125 '}' \n", " 126 '~' \n", " 127 '\\x7f'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# integers 0, 1, ..., 127 and corresponding ascii character\n", "[0:127 Char.(0:127)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Extended ASCII: 8 bits, $2^8=256$ characters. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "128×2 Array{Any,2}:\n", " 128 '\\u80'\n", " 129 '\\u81'\n", " 130 '\\u82'\n", " 131 '\\u83'\n", " 132 '\\u84'\n", " 133 '\\u85'\n", " 134 '\\u86'\n", " 135 '\\u87'\n", " 136 '\\u88'\n", " 137 '\\u89'\n", " 138 '\\u8a'\n", " 139 '\\u8b'\n", " 140 '\\u8c'\n", " ⋮ \n", " 244 'ô' \n", " 245 'õ' \n", " 246 'ö' \n", " 247 '÷' \n", " 248 'ø' \n", " 249 'ù' \n", " 250 'ú' \n", " 251 'û' \n", " 252 'ü' \n", " 253 'ý' \n", " 254 'þ' \n", " 255 'ÿ' " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# integers 128, 129, ..., 255 and corresponding extended ascii character\n", "# show(STDOUT, \"text/plain\", [128:255 Char.(128:255)])\n", "[128:255 Char.(128:255)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Unicode: UTF-8, UTF-16 and UTF-32 support many more characters including foreign characters; last 7 digits conform to ASCII. \n", "\n", "* [UTF-8](https://en.wikipedia.org/wiki/UTF-8) is the current dominant character encoding on internet. \n", "\n", "\n", "\n", "* Julia supports the full range of UTF-8 characters. You can type many Unicode math symbols by typing the backslashed LaTeX symbol name followed by tab. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# \\beta-\n", "β = 0.0\n", "# \\beta--\\hat-\n", "β̂ = 0.0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* For a table of unicode symbols that can be entered via tab completion of LaTeX-like abbreviations: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Integers: fixed-point number system\n", "\n", "* Fixed-point number system is a computer model for integers $\\mathbb{Z}$. \n", "\n", "* The number of bits and method of representing negative numbers vary from system to system. \n", " - The `integer` type in R has $M=32$ or 64 bits, determined by machine word size. \n", " - Matlab has `(u)int8`, `(u)int16`, `(u)int32`, `(u)int64`. \n", "\n", "* Julia has even more integer types. Using Tom Breloff's `Plots.jl` and `GraphRecipes.jl` packages, we can [visualize the type tree](http://www.breloff.com/Graphs/) under `Integer`\n", " - Storage for a `Signed` or `Unsigned` integer can be $M = 8, 16, 32, 64$ or 128 bits. \n", " - GraphRecipes.jl package has a convenience function for plotting the type hiearchy." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "Any\n", "\n", "\n", "Number\n", "\n", "\n", "Real\n", "\n", "\n", "Integer\n", "\n", "\n", "Bool\n", "\n", "\n", "OffsetInteger\n", "\n", "\n", "Signed\n", "\n", "\n", "BigInt\n", "\n", "\n", "Int128\n", "\n", "\n", "Int16\n", "\n", "\n", "Int32\n", "\n", "\n", "Int64\n", "\n", "\n", "Int8\n", "\n", "\n", "Unsigned\n", "\n", "\n", "UInt128\n", "\n", "\n", "UInt16\n", "\n", "\n", "UInt32\n", "\n", "\n", "UInt64\n", "\n", "\n", "UInt8\n", "\n", "\n" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using GraphRecipes, Plots\n", "\n", "#pyplot(size=(800, 600))\n", "gr(size=(600, 400))\n", "theme(:default)\n", "\n", "plot(Integer, method=:tree, fontsize=4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Signed integers\n", "\n", "* First bit indicates sign. \n", " - `0` for nonnegative numbers\n", " - `1` for negative numbers \n", " \n", "* **Two's complement representation** for negative numbers. \n", " - Sign bit is set to 1 \n", " - remaining bits are set to opposite values \n", " - 1 is added to the result \n", " - Two's complement representation of a negative integer `x` is same as the unsigned integer `2^64 + x`." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "typeof(18) = Int64\n", "bitstring(18) = \"0000000000000000000000000000000000000000000000000000000000010010\"\n", "bitstring(-18) = \"1111111111111111111111111111111111111111111111111111111111101110\"\n", "bitstring(UInt64(Int128(2) ^ 64 - 18)) == bitstring(-18) = true\n", "bitstring(2 * 18) = \"0000000000000000000000000000000000000000000000000000000000100100\"\n", "bitstring(2 * -18) = \"1111111111111111111111111111111111111111111111111111111111011100\"\n" ] } ], "source": [ "@show typeof(18)\n", "@show bitstring(18)\n", "@show bitstring(-18)\n", "@show bitstring(UInt64(Int128(2)^64 - 18)) == bitstring(-18)\n", "@show bitstring(2 * 18) # shift bits of 18\n", "@show bitstring(2 * -18); # shift bits of -18" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Two's complement representation respects modular arithmetic nicely. \n", " Addition of any two signed integers are just bitwise addition, possibly modulo $2^M$\n", " \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* **Range** of representable integers by $M$-bit **signed integer** is $[-2^{M-1},2^{M-1}-1]$.\n", " - Julia functions `typemin(T)` and `typemax(T)` give the lowest and highest representable number of a type `T` respectively" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(-9223372036854775808, 9223372036854775807)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "typemin(Int64), typemax(Int64)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Int8\t-128\t127\n", "Int16\t-32768\t32767\n", "Int32\t-2147483648\t2147483647\n", "Int64\t-9223372036854775808\t9223372036854775807\n", "Int128\t-170141183460469231731687303715884105728\t170141183460469231731687303715884105727\n" ] } ], "source": [ "for T in [Int8, Int16, Int32, Int64, Int128]\n", " println(T, '\\t', typemin(T), '\\t', typemax(T))\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Unsigned integers\n", "\n", "* For unsigned integers, the range is $[0,2^M-1]$." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "UInt8\t0\t255\n", "UInt16\t0\t65535\n", "UInt32\t0\t4294967295\n", "UInt64\t0\t18446744073709551615\n", "UInt128\t0\t340282366920938463463374607431768211455\n" ] } ], "source": [ "for t in [UInt8, UInt16, UInt32, UInt64, UInt128]\n", " println(t, '\\t', typemin(t), '\\t', typemax(t))\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `BigInt`\n", "\n", "Julia `BigInt` type is arbitrary precision." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "typemax(Int128) = 170141183460469231731687303715884105727\n", "typemax(Int128) + 1 = -170141183460469231731687303715884105728\n", "BigInt(typemax(Int128)) + 1 = 170141183460469231731687303715884105728\n" ] } ], "source": [ "@show typemax(Int128)\n", "@show typemax(Int128) + 1 # modular arithmetic!\n", "@show BigInt(typemax(Int128)) + 1;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Overflow and underflow for integer arithmetic\n", "\n", "R reports `NA` for integer overflow and underflow. \n", "**Julia outputs the result according to modular arithmetic.**" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "typemax(Int32) = 2147483647\n", "typemax(Int32) + Int32(1) = -2147483648\n" ] } ], "source": [ "@show typemax(Int32)\n", "@show typemax(Int32) + Int32(1); # modular arithmetics!" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RObject{IntSxp}\n", "[1] 2147483647\n" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using RCall\n", "\n", "R\"\"\"\n", ".Machine$integer.max\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RObject{IntSxp}\n", "[1] 2147483647\n" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "R\"\"\"\n", "M <- 32\n", "big <- 2^(M-1) - 1\n", "as.integer(big)\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "┌ Warning: RCall.jl: Warning: NAs introduced by coercion to integer range\n", "└ @ RCall /Users/huazhou/.julia/packages/RCall/ffM0W/src/io.jl:113\n" ] }, { "data": { "text/plain": [ "RObject{IntSxp}\n", "[1] NA\n" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "R\"\"\"\n", "as.integer(big+1)\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Real numbers: floating-number system\n", "\n", "Floating-point number system is a computer model for real numbers.\n", "\n", "* Most computer systems adopt the [IEEE 754 standard](https://en.wikipedia.org/wiki/IEEE_floating_point), established in 1985, for floating-point arithmetics. \n", "For the history, see an [interview with William Kahan](http://www.cs.berkeley.edu/~wkahan/ieee754status/754story.html).\n", "\n", "* In the scientific notation, a real number is represented as\n", "$$\\pm d_0.d_1d_2 \\cdots d_p \\times b^e.$$\n", "In computer, the _base_ is $b=2$ and the digits $d_i$ are 0 or 1.\n", "\n", "* **Normalized** vs **denormalized** numbers. For example, decimal number 18 is\n", "$$ +1.0010 \\times 2^4 \\quad (\\text{normalized})$$\n", "or, equivalently,\n", "$$ +0.1001 \\times 2^5 \\quad (\\text{denormalized}).$$\n", "\n", "* In the floating-number system, computer stores \n", " - sign bit \n", " - the _fraction_ (or _mantissa_, or _significand_) of the **normalized** representation\n", " - the actual exponent _plus_ a bias" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "Any\n", "\n", "\n", "Number\n", "\n", "\n", "Real\n", "\n", "\n", "AbstractFloat\n", "\n", "\n", "BigFloat\n", "\n", "\n", "Float16\n", "\n", "\n", "Float32\n", "\n", "\n", "Float64\n", "\n", "\n" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using GraphRecipes, Plots\n", "\n", "#pyplot(size=(800, 600))\n", "gr(size=(600, 400))\n", "theme(:default)\n", "\n", "plot(AbstractFloat, method=:tree, fontsize=4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Double precision (Float64)\n", "\n", "\n", "\n", "- Double precision (64 bits = 8 bytes) numbers are the dominant data type in scientific computing.\n", " \n", "- In Julia, `Float64` is the type for double precision numbers. \n", "\n", "- First bit is sign bit. \n", "\n", "- $p=52$ significant bits.\n", "\n", "- 11 exponent bits: $e_{\\max}=1023$, $e_{\\min}=-1022$, **bias**=1023. \n", "\n", "- $e_{\\text{min}}-1$ and $e_{\\text{max}}+1$ are reserved for special numbers. \n", "\n", "- range of **magnitude**: $10^{\\pm 308}$ in decimal because $\\log_{10} (2^{1023}) \\approx 308$. \n", "\n", "- **precision** to the $\\log_{10}(2^{-52}) \\approx 15$ decimal point." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Double precision:\n", "bitstring(Float64(18)) = \"0100000000110010000000000000000000000000000000000000000000000000\"\n", "bitstring(Float64(-18)) = \"1100000000110010000000000000000000000000000000000000000000000000\"\n" ] } ], "source": [ "println(\"Double precision:\")\n", "@show bitstring(Float64(18)) # 18 in double precision\n", "@show bitstring(Float64(-18)); # -18 in double precision" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Single precision (Float32)\n", "\n", "\n", "\n", "- In Julia, `Float32` is the type for single precision numbers. \n", "\n", "- First bit is sign bit. \n", "\n", "- $p=23$ significant bits. \n", "\n", "- 8 exponent bits: $e_{\\max}=127$, $e_{\\min}=-126$, **bias**=127. \n", "\n", "- $e_{\\text{min}}-1$ and $e_{\\text{max}}+1$ are reserved for special numbers. \n", "\n", "- range of **magnitude**: $10^{\\pm 38}$ in decimal because $\\log_{10} (2^{127}) \\approx 38$.\n", "\n", "- **precision**: $\\log_{10}(2^{23}) \\approx 7$ decimal point. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Single precision:\n", "bitstring(Float32(18.0)) = \"01000001100100000000000000000000\"\n", "bitstring(Float32(-18.0)) = \"11000001100100000000000000000000\"\n" ] } ], "source": [ "println(\"Single precision:\")\n", "@show bitstring(Float32(18.0)) # 18 in single precision\n", "@show bitstring(Float32(-18.0)); # -18 in single precision" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Half precision (Float16)\n", "\n", "\n", " \n", "- In Julia, `Float16` is the type for half precision numbers.\n", "\n", "- First bit is sign bit. \n", "\n", "- $p=10$ significant bits. \n", "\n", "- 5 exponent bits: $e_{\\max}=15$, $e_{\\min}=-14$, bias=15. \n", "\n", "- $e_{\\text{min}}-1$ and $e_{\\text{max}}+1$ are reserved for special numbers. \n", "\n", "- range of **magnitude**: $10^{\\pm 4}$ in decimal because $\\log_{10} (2^{15}) \\approx 4$. \n", "\n", "- **precision**: $\\log_{10}(2^{10}) \\approx 3$ decimal point. " ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Half precision:\n", "bitstring(Float16(18)) = \"0100110010000000\"\n", "bitstring(Float16(-18)) = \"1100110010000000\"\n" ] } ], "source": [ "println(\"Half precision:\")\n", "@show bitstring(Float16(18)) # 18 in half precision\n", "@show bitstring(Float16(-18)); # -18 in half precision" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Special floating-point numbers. \n", "\n", "- Exponent $e_{\\max}+1$ plus a zero mantissa means $\\pm \\infty$." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bitstring(Inf) = \"0111111111110000000000000000000000000000000000000000000000000000\"\n", "bitstring(-Inf) = \"1111111111110000000000000000000000000000000000000000000000000000\"\n" ] } ], "source": [ "@show bitstring(Inf) # Inf in double precision\n", "@show bitstring(-Inf); # -Inf in double precision" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Exponent $e_{\\max}+1$ plus a nonzero mantissa means `NaN`. `NaN` could be produced from `0 / 0`, `0 * Inf`, ... \n", "\n", "- In general `NaN ≠ NaN` bitwise. Test whether a number is `NaN` by `isnan` function. " ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bitstring(0 / 0) = \"1111111111111000000000000000000000000000000000000000000000000000\"\n", "bitstring(0Inf) = \"1111111111111000000000000000000000000000000000000000000000000000\"\n" ] } ], "source": [ "@show bitstring(0 / 0) # NaN\n", "@show bitstring(0 * Inf); # NaN" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Exponent $e_{\\min}-1$ with a zero mantissa represents the real number 0. " ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bitstring(0.0) = \"0000000000000000000000000000000000000000000000000000000000000000\"\n" ] } ], "source": [ "@show bitstring(0.0); # 0 in double precision " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Exponent $e_{\\min}-1$ with a nonzero mantissa are for numbers less than $b^{e_{\\min}}$. \n", " Numbers are _denormalized_ in the range $(0,b^{e_{\\min}})$ -- **graceful underflow**. " ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "nextfloat(0.0) = 5.0e-324\n", "bitstring(nextfloat(0.0)) = \"0000000000000000000000000000000000000000000000000000000000000001\"\n" ] } ], "source": [ "@show nextfloat(0.0) # next representable number\n", "@show bitstring(nextfloat(0.0)); # denormalized" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Rounding\n", "\n", "* Rounding is necessary whenever a number has more than $p$ significand bits. Most computer systems use the default IEEE 754 _round to nearest_ mode (also called _ties to even_ mode). Julia offers several [rounding modes](https://docs.julialang.org/en/v1/base/math/#Base.Rounding.RoundingMode), the default being [`RoundNearest`](https://docs.julialang.org/en/v1/base/math/#Base.Rounding.RoundNearest). For example, the number 0.1 in decimal system cannot be represented accurately as a floating point number:\n", "$$ 0.1 = 1.10011001... \\times 2^{-4} $$" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bitstring(0.1f0) = \"00111101110011001100110011001101\"\n", "bitstring(0.1) = \"0011111110111001100110011001100110011001100110011001100110011010\"\n" ] } ], "source": [ "@show bitstring(0.1f0) # single precision Float32, 1001 gets rounded to 101(0)\n", "@show bitstring(0.1); # double precision Float64" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summary\n", "\n", "- Single precision: range $\\pm 10^{\\pm 38}$ with precision up to 7 decimal digits.\n", "\n", "- Double precision: range $\\pm 10^{\\pm 308}$ with precision up to 16 decimal digits. \n", "\n", "- The floating-point numbers do not occur uniformly over the real number line\n", " \n", " Each magnitude has same number of representible numbers\n", " \n", "- **Machine epsilons** are the spacings of numbers around 1: \n", " $$\\epsilon_{\\min}=b^{-p}, \\quad \\epsilon_{\\max} = b^{1-p}.$$\n", " " ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "eps(Float32) = 1.1920929f-7\n", "eps(Float64) = 2.220446049250313e-16\n", "eps(100.0) = 1.4210854715202004e-14\n", "eps(0.0) = 5.0e-324\n", "x = 1.25f0 = 1.25f0\n", "(prevfloat(x), x, nextfloat(x)) = (1.2499999f0, 1.25f0, 1.2500001f0)\n", "(bitstring(prevfloat(x)), bitstring(x), bitstring(nextfloat(x))) = (\"00111111100111111111111111111111\", \"00111111101000000000000000000000\", \"00111111101000000000000000000001\")\n" ] } ], "source": [ "@show eps(Float32) # machine epsilon for a floating point type\n", "@show eps(Float64) # same as eps()\n", "# eps(x) is the spacing after x\n", "@show eps(100.0)\n", "@show eps(0.0)\n", "# nextfloat(x) and prevfloat(x) give the neighbors of x\n", "@show x = 1.25f0\n", "@show prevfloat(x), x, nextfloat(x)\n", "@show bitstring(prevfloat(x)), bitstring(x), bitstring(nextfloat(x));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* In R, the variable `.Machine` contains numerical characteristics of the machine." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RObject{VecSxp}\n", "$double.eps\n", "[1] 2.220446e-16\n", "\n", "$double.neg.eps\n", "[1] 1.110223e-16\n", "\n", "$double.xmin\n", "[1] 2.225074e-308\n", "\n", "$double.xmax\n", "[1] 1.797693e+308\n", "\n", "$double.base\n", "[1] 2\n", "\n", "$double.digits\n", "[1] 53\n", "\n", "$double.rounding\n", "[1] 5\n", "\n", "$double.guard\n", "[1] 0\n", "\n", "$double.ulp.digits\n", "[1] -52\n", "\n", "$double.neg.ulp.digits\n", "[1] -53\n", "\n", "$double.exponent\n", "[1] 11\n", "\n", "$double.min.exp\n", "[1] -1022\n", "\n", "$double.max.exp\n", "[1] 1024\n", "\n", "$integer.max\n", "[1] 2147483647\n", "\n", "$sizeof.long\n", "[1] 8\n", "\n", "$sizeof.longlong\n", "[1] 8\n", "\n", "$sizeof.longdouble\n", "[1] 16\n", "\n", "$sizeof.pointer\n", "[1] 8\n", "\n" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "R\"\"\"\n", ".Machine\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Julia provides `Float16` (half precision), `Float32` (single precision), `Float64` (double precision), and `BigFloat` (arbitrary precision)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Overflow and underflow of floating-point number\n", "\n", "* For double precision, the range is $\\pm 10^{\\pm 308}$. In most situations, underflow (magnitude of result is less than $10^{-308}$) is preferred over overflow (magnitude of result is larger than $10^{-308}$). Overflow produces $\\pm \\inf$. Underflow yields zeros or denormalized numbers. \n", "\n", "* E.g., the logit link function is\n", "$$p = \\frac{\\exp (x^T \\beta)}{1 + \\exp (x^T \\beta)} = \\frac{1}{1+\\exp(- x^T \\beta)}.$$\n", "The former expression can easily lead to `Inf / Inf = NaN`, while the latter expression leads to graceful underflow.\n", "\n", "* `floatmin` and `floatmax` functions gives largest and smallest _finite_ number represented by a type." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Float16\t6.104e-5\t6.55e4\t-Inf\tInf\t0.000977\n", "Float32\t1.1754944e-38\t3.4028235e38\t-Inf\tInf\t1.1920929e-7\n", "Float64\t2.2250738585072014e-308\t1.7976931348623157e308\t-Inf\tInf\t2.220446049250313e-16\n" ] } ], "source": [ "for T in [Float16, Float32, Float64]\n", " println(T, '\\t', floatmin(T), '\\t', floatmax(T), '\\t', typemin(T), \n", " '\\t', typemax(T), '\\t', eps(T))\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* `BigFloat` in Julia offers arbitrary precision." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "precision(BigFloat) = 256\n", "floatmin(BigFloat) = 8.50969131174083613912978790962048280567755996982969624908264897850135431080301e-1388255822130839284\n", "floatmax(BigFloat) = 5.875653789111587590936911998878442589938516392745498308333779606469323584389875e+1388255822130839282\n" ] } ], "source": [ "@show precision(BigFloat)\n", "@show floatmin(BigFloat)\n", "@show floatmax(BigFloat);" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "BigFloat(π) = 3.141592653589793238462643383279502884197169399375105820974944592307816406286198\n", "BigFloat(π) = 3.141592653589793238462643383279502884197169399375105820974944592307816406286208998628034825342117067982148086513282306647093844609550582231725359408128481117450284102701938521105559644622948954930381964428810975665933446128475648233786783165271201909145648566923460348610454326648213393607260249141273724586997\n" ] } ], "source": [ "@show BigFloat(π); # default precision for BigFloat is 256 bits\n", "# set precision to 1024 bits\n", "setprecision(BigFloat, 1024) do \n", " @show BigFloat(π)\n", "end;" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Catastrophic cancellation\n", "\n", "* **Scenario 1**: Addition or subtraction of two numbers of widely different magnitudes: $a+b$ or $a-b$ where $a \\gg b$ or $a \\ll b$. We loose the precision in the number of smaller magnitude. Consider \n", "$$\\begin{eqnarray*}\n", " a &=& x.xxx ... \\times 2^{30} \\\\ \n", " b &=& y.yyy... \\times 2^{-30}\n", "\\end{eqnarray*}$$\n", "What happens when computer calculates $a+b$? We get $a+b=a$!" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a = 2.0 ^ 30 = 1.073741824e9\n", "b = 2.0 ^ -30 = 9.313225746154785e-10\n", "a + b == a = true\n" ] }, { "data": { "text/plain": [ "true" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@show a = 2.0^30\n", "@show b = 2.0^-30\n", "@show a + b == a" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* **Scenario 2**: Subtraction of two nearly equal numbers eliminates significant digits. $a-b$ where $a \\approx b$. Consider \n", "$$\\begin{eqnarray*}\n", " a &=& x.xxxxxxxxxx1ssss \\\\\n", " b &=& x.xxxxxxxxxx0tttt\n", "\\end{eqnarray*}$$\n", "The result is $1.vvvvu...u$ where $u$ are unassigned digits." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "bitstring(a) = \"00111111100111100000011001010001\"\n", "bitstring(b) = \"00111111100111100000011001010000\"\n", "a - b = 1.1920929f-7\n" ] }, { "data": { "text/plain": [ "1.1920929f-7" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = 1.2345678f0 # rounding\n", "@show bitstring(a) # rounding\n", "b = 1.2345677f0\n", "@show bitstring(b)\n", "@show a - b # correct result should be 1e-7" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Implications for numerical computation\n", " - Rule 1: add small numbers together before adding larger ones \n", " - Rule 2: add numbers of like magnitude together (paring). When all numbers are of same sign and similar magnitude, add in pairs so each stage the summands are of similar magnitude \n", " - Rule 3: avoid substraction of two numbers that are nearly equal" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Algebraic laws\n", "\n", "Floating-point numbers may violate many algebraic laws we are familiar with, such associative and distributive laws. See Homework 1 problems." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Further reading\n", "\n", "* Textbook treatment, e.g., Chapter II.2 of [Computational Statistics](http://ucla.worldcat.org/title/computational-statistics/oclc/437345409&referer=brief_results) by James Gentle (2010).\n", "\n", "* [What every computer scientist should know about floating-point arithmetic](http://hua-zhou.github.io/teaching/biostatm280-2017spring/readings/Goldberg91FloatingPoint.pdf) by David Goldberg (1991)." ] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.1.0", "language": "julia", "name": "julia-1.1" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.1.0" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "250px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "skip_h1_title": true, "threshold": 4, "toc_cell": true, "toc_section_display": "block", "toc_window_display": true, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 2 }