{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Installing packages:\n", "\t.package(path: \"/home/jupyter/notebooks/swift/FastaiNotebook_02a_why_sqrt5\")\n", "\t\tFastaiNotebook_02a_why_sqrt5\n", "With SwiftPM flags: []\n", "Working in: /tmp/tmp1on2fw8s/swift-install\n", "[1/6] Compiling FastaiNotebook_02a_why_sqrt5 01_matmul.swift\n", "[2/6] Compiling FastaiNotebook_02a_why_sqrt5 02_fully_connected.swift\n", "[3/6] Compiling FastaiNotebook_02a_why_sqrt5 02a_why_sqrt5.swift\n", "[4/6] Compiling FastaiNotebook_02a_why_sqrt5 00_load_data.swift\n", "[5/6] Compiling FastaiNotebook_02a_why_sqrt5 01a_fastai_layers.swift\n", "[6/7] Merging module FastaiNotebook_02a_why_sqrt5\n", "[7/8] Compiling jupyterInstalledPackages jupyterInstalledPackages.swift\n", "[8/9] Merging module jupyterInstalledPackages\n", "[9/9] Linking libjupyterInstalledPackages.so\n", "Initializing Swift...\n", "Installation complete!\n" ] } ], "source": [ "%install-location $cwd/swift-install\n", "%install '.package(path: \"$cwd/FastaiNotebook_02a_why_sqrt5\")' FastaiNotebook_02a_why_sqrt5" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "//export\n", "import Path\n", "import TensorFlow" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import FastaiNotebook_02a_why_sqrt5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why you need a good init" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To understand why initialization is important in a neural net, we'll focus on the basic operation you have there: matrix multiplications. So let's just take a vector `x`, and a matrix `a` initialized randomly, then multiply them 100 times (as if we had 100 layers). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var x = TF(randomNormal: [512, 1])\n", "let a = TF(randomNormal: [512,512])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i in 0..<100 { x = a • x }" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "▿ 2 elements\n", " - .0 : nan(0x1fffff)\n", " - .1 : nan(0x1fffff)\n" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(x.mean(),x.std())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The problem you'll get with that is activation explosion: very soon, your activations will go to nan. We can even ask the loop to break when that first happens:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var x = TF(randomNormal: [512, 1])\n", "let a = TF(randomNormal: [512,512])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "27\r\n" ] } ], "source": [ "for i in 0..<100 {\n", " x = a • x\n", " if x.std().scalarized().isNaN {\n", " print(i)\n", " break\n", " }\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It only takes around 30 multiplications! On the other hand, if you initialize your activations with a scale that is too low, then you'll get another problem:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var x = TF(randomNormal: [512, 1])\n", "let a = TF(randomNormal: [512,512]) * 0.01" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i in 0..<100 { x = a • x }" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "▿ 2 elements\n", " - .0 : 0.0\n", " - .1 : 0.0\n" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(x.mean(),x.std())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, every activation vanished to 0. So to avoid that problem, people have come with several strategies to initialize their weight matrices, such as:\n", "- use a standard deviation that will make sure x and Ax have exactly the same scale\n", "- use an orthogonal matrix to initialize the weight (orthogonal matrices have the special property that they preserve the L2 norm, so x and Ax would have the same sum of squares in that case)\n", "- use [spectral normalization](https://arxiv.org/pdf/1802.05957.pdf) on the matrix A (the spectral norm of A is the least possible number M such that `matmul(A,x).norm() <= M*x.norm()` so dividing A by this M insures you don't overflow. You can still vanish with this)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The magic number for scaling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we will focus on the first one, which is the Xavier initialization. It tells us that we should use a scale equal to `1/sqrt(n_in)` where `n_in` is the number of inputs of our matrix." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "var x = TF(randomNormal: [512, 1])\n", "let a = TF(randomNormal: [512,512]) / sqrt(512)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i in 0..<100 { x = a • x }" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "▿ 2 elements\n", " - mean : 0.061417937\n", " - std : 1.4370023\n" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(mean: x.mean(), std: x.std())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And indeed it works. Note that this magic number isn't very far from the 0.01 we had earlier." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.044194173824159216\n" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "1 / sqrt(512)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But where does it come from? It's not that mysterious if you remember the definition of the matrix multiplication. When we do `y = matmul(a, x)`, the coefficients of `y` are defined by\n", "\n", "$$y_{i} = a_{i,0} x_{0} + a_{i,1} x_{1} + \\cdots + a_{i,n-1} x_{n-1} = \\sum_{k=0}^{n-1} a_{i,k} x_{k}$$\n", "\n", "or in code:\n", "```\n", "for i in 0..