{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Installing packages:\n", "\t.package(path: \"/home/ubuntu/fastai_docs/dev_swift/FastaiNotebook_02a_why_sqrt5\")\n", "\t\tFastaiNotebook_02a_why_sqrt5\n", "With SwiftPM flags: []\n", "Working in: /tmp/tmp9_7n1nl5\n", "Fetching https://github.com/mxcl/Path.swift\n", "Fetching https://github.com/JustHTTP/Just\n", "Completed resolution in 2.26s\n", "Cloning https://github.com/JustHTTP/Just\n", "Resolving https://github.com/JustHTTP/Just at 0.7.1\n", "Cloning https://github.com/mxcl/Path.swift\n", "Resolving https://github.com/mxcl/Path.swift at 0.16.2\n", "Compile Swift Module 'Just' (1 sources)\n", "Compile Swift Module 'Path' (9 sources)\n", "Compile Swift Module 'FastaiNotebook_02a_why_sqrt5' (6 sources)\n", "Compile Swift Module 'jupyterInstalledPackages' (1 sources)\n", "Linking ./.build/x86_64-unknown-linux/debug/libjupyterInstalledPackages.so\n", "Initializing Swift...\n", "Loading library...\n", "Installation complete!\n" ] } ], "source": [ "%install '.package(path: \"$cwd/FastaiNotebook_02a_why_sqrt5\")' FastaiNotebook_02a_why_sqrt5" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import FastaiNotebook_02a_why_sqrt5\n", "import TensorFlow" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why you need a good init" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To understand why initialization is important in a neural net, we'll focus on the basic operation you have there: matrix multiplications. So let's just take a vector `x`, and a matrix `a` initiliazed randomly, then multiply them 100 times (as if we had 100 layers). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var x = TF(randomNormal: [512, 1])\n", "let a = TF(randomNormal: [512,512])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i in 0..<100 { x = matmul(a, x) }" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "▿ 2 elements\n", " - .0 : nan(0x1fffff)\n", " - .1 : nan(0x1fffff)\n" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(x.mean(),x.standardDeviation())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The problem you'll get with that is activation explosion: very soon, your activations will go to nan. We can even ask the loop to break when that first happens:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var x = TF(randomNormal: [512, 1])\n", "let a = TF(randomNormal: [512,512])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "27\r\n" ] } ], "source": [ "for i in 0..<100 {\n", " x = matmul(a, x)\n", " if x.standardDeviation().scalarized().isNaN {\n", " print(i)\n", " break\n", " }\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It only takes around 30 multiplications! On the other hand, if you initialize your activations with a scale that is too low, then you'll get another problem:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var x = TF(randomNormal: [512, 1])\n", "let a = TF(randomNormal: [512,512]) * 0.01" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i in 0..<100 { x = matmul(a, x) }" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "▿ 2 elements\n", " - .0 : 0.0\n", " - .1 : 0.0\n" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(x.mean(),x.standardDeviation())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, every activation vanished to 0. So to avoid that problem, people have come with several strategies to initialize their weight matices, such as:\n", "- use a standard deviation that will make sure x and Ax have exactly the same scale\n", "- use an orthogonal matrix to initialize the weight (orthogonal matrices have the special property that they preserve the L2 norm, so x and Ax would have the same sum of squares in that case)\n", "- use [spectral normalization](https://arxiv.org/pdf/1802.05957.pdf) on the matrix A (the spectral norm of A is the least possible number M such that `matmul(A,x).norm() <= M*x.norm()` so dividing A by this M insures you don't overflow. You can still vanish with this)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The magic number for scaling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we will focus on the first one, which is the Xavier initialization. It tells us that we should use a scale equal to `1/sqrt(n_in)` where `n_in` is the number of inputs of our matrix." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var x = TF(randomNormal: [512, 1])\n", "let a = TF(randomNormal: [512,512]) / sqrt(512)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i in 0..<100 { x = matmul(a, x) }" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "▿ 2 elements\n", " - .0 : -1.0634497\n", " - .1 : 10.697724\n" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(x.mean(),x.standardDeviation())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And indeed it works. Note that this magic number isn't very far from the 0.01 we had earlier." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.044194173824159216\n" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "1 / sqrt(512)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But where does it come from? It's not that mysterious if you remember the definition of the matrix multiplication. When we do `y = matmul(a, x)`, the coefficients of `y` are defined by\n", "\n", "$$y_{i} = a_{i,0} x_{0} + a_{i,1} x_{1} + \\cdots + a_{i,n-1} x_{n-1} = \\sum_{k=0}^{n-1} a_{i,k} x_{k}$$\n", "\n", "or in code:\n", "```\n", "for i in 0..