{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4.1 Layers and Blocks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- MXNet's Blocks\n",
    "  - Layers are blocks\n",
    "  - Many layers can be a block\n",
    "  - Many blocks can be a block\n",
    "  - Code can be a block\n",
    "  - Blocks take are of a lot of housekeeping, such as parameter initialization, backprop and related issues.\n",
    "  - Sequential concatenations of layers and blocks are handled by the eponymous Sequential block.\n",
    "\n",
    "- Blocks are combinations of one or more layers. \n",
    "- Network design is aided by code that generates such blocks on demand."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\n",
       "[[ 0.09543003  0.04614332 -0.00286653 -0.07790346 -0.05130243  0.02942039\n",
       "   0.08696645 -0.0190793  -0.04122177  0.05088576]\n",
       " [ 0.0769287   0.03099705  0.00856576 -0.04467198 -0.0692684   0.09132432\n",
       "   0.06786594 -0.06187843 -0.03436674  0.04234695]]\n",
       "<NDArray 2x10 @cpu(0)>"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from mxnet import nd\n",
    "from mxnet.gluon import nn\n",
    "\n",
    "x = nd.random.uniform(shape=(2, 20))\n",
    "\n",
    "net = nn.Sequential()\n",
    "net.add(nn.Dense(256, activation='relu'))\n",
    "net.add(nn.Dense(10))\n",
    "net.initialize()\n",
    "net(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- We used `nn.Sequential` constructor to generate an empty network into which we then inserted two layers.\n",
    "  - This really just constructs a block. \n",
    "  - These blocks can be combined into larger artifacts, often recursively.\n",
    "  ![](https://github.com/d2l-ai/d2l-en/raw/master/img/blocks.svg?sanitize=true)\n",
    "- A block behaves very much like a fancy layer\n",
    "  - It needs to ingest data (the input).\n",
    "  - It needs to produce a meaningful output. \n",
    "    - It allows us to invoke a block via net(X) to obtain the desired output. \n",
    "    - It invokes forward to perform forward propagation.\n",
    "  - It needs to produce a gradient with regard to its input when invoking backward. \n",
    "    - Typically this is automatic.\n",
    "  - It needs to store parameters that are inherent to the block. \n",
    "  - Obviously it also needs to initialize these parameters as needed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.1.1 A Custom Block\n",
    "- The `nn.Block` class \n",
    "  - It is a model constructor provided in the `nn` module, which we can inherit to define the model we want. \n",
    "- The following `MLP` class inherits the `Block` class to construct the multilayer perceptron\n",
    "  - It overrides the `__init__` and `forward` functions of the Block class. \n",
    "  - They are used to create model parameters and define forward computations, respectively. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mxnet import nd\n",
    "from mxnet.gluon import nn\n",
    "\n",
    "class MLP(nn.Block):\n",
    "    # Declare a layer with model parameters. \n",
    "    # Here, we declare two fully connected layers.\n",
    "    def __init__(self, **kwargs):\n",
    "        # Call the constructor of the MLP parent class Block to perform the necessary initialization. \n",
    "        # In this way, other function parameters can also be specified when constructing an instance, \n",
    "        # such as the model parameter, params, described in the following sections.\n",
    "        super(MLP, self).__init__(**kwargs)\n",
    "        self.hidden = nn.Dense(256, activation='relu')  # Hidden layer.\n",
    "        self.output = nn.Dense(10)  # Output layer.\n",
    "\n",
    "    # Define the forward computation of the model\n",
    "    # That is, how to return the required model output based on the input x.\n",
    "    def forward(self, x):\n",
    "        return self.output(self.hidden(x))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- The `forward` method invokes a network simply by evaluating the hidden layer `self.hidden(x)` and subsequently by evaluating the output layer `self.output( ... )`. \n",
    "  - This is what we expect in the forward pass of this block.\n",
    "- The `__init__` method \n",
    "  - Define the layers. \n",
    "    - Initializes all of the Block-related parameters and then constructs the requisite layers. \n",
    "- There is no need to define a backpropagation method in the class. \n",
    "  - The system automatically generates the backward method\n",
    "- The same applies to the initialize method, which is generated automatically."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\n",
       "[[ 0.0036223   0.00633331  0.03201144 -0.01369375  0.10336448 -0.03508019\n",
       "  -0.00032164 -0.01676024  0.06978628  0.01303308]\n",
       " [ 0.03871716  0.02608212  0.03544959 -0.02521311  0.11005434 -0.01430662\n",
       "  -0.03052465 -0.03852826  0.06321152  0.0038594 ]]\n",
       "<NDArray 2x10 @cpu(0)>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "net = MLP()\n",
    "net.initialize()\n",
    "net(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- The `block` class's subclass...\n",
    "  - it can be a layer (such as the `Dense` class provided by Gluon), \n",
    "  - it can be a model (such as the `MLP` class we just derived), \n",
    "  - it can be a part of a model (this is what typically happens when designing very deep networks). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.1.2 A Sequential Block\n",
    "- The purpose of the `Sequential` class is to provide some useful convenience functions. \n",
    "  - The `add` method allows us to add concatenated `Block` subclass instances one by one, \n",
    "  - The `forward` computation of the model is to compute these instances one by one in the order of addition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "class MySequential(nn.Block):\n",
    "    def __init__(self, **kwargs):\n",
    "        super(MySequential, self).__init__(**kwargs)\n",
    "\n",
    "    def add(self, block):\n",
    "        # Here, block is an instance of a Block subclass, and we assume it has a unique name. \n",
    "        # We save it in the member variable _children of the Block class, and its type is OrderedDict. \n",
    "        self._children[block.name] = block\n",
    "\n",
    "    def forward(self, x):\n",
    "        # OrderedDict guarantees that members will be traversed in the order they were added.\n",
    "        for block in self._children.values():\n",
    "            x = block(x)\n",
    "        return x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- When `MySequential` instance calls the initialize function, the system automatically initializes all members of _children."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\n",
       "[[ 0.07787763  0.00216402  0.016822    0.0305988  -0.00702019  0.01668715\n",
       "   0.04822846  0.0039432  -0.09300035 -0.04494302]\n",
       " [ 0.08891079 -0.00625484 -0.01619132  0.03807179 -0.01451489  0.02006173\n",
       "   0.0303478   0.02463485 -0.07605447 -0.04389168]]\n",
       "<NDArray 2x10 @cpu(0)>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "net = MySequential()\n",
    "net.add(nn.Dense(256, activation='relu'))\n",
    "net.add(nn.Dense(10))\n",
    "net.initialize()\n",
    "net(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.1.3 Blocks with Code\n",
    "- ***Constant*** model parameter\n",
    "  - These are parameters that are not used when invoking backprop. \n",
    "  $$f(\\mathbf{x},\\mathbf{w}) = 3 \\cdot \\mathbf{w}^\\top \\mathbf{x}.$$\n",
    "\n",
    "  - In this case 3 is a constant parameter. \n",
    "  - We could change 3 to something else, say $c$ via\n",
    "  $$f(\\mathbf{x},\\mathbf{w}) = c \\cdot \\mathbf{w}^\\top \\mathbf{x}.$$\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "class FancyMLP(nn.Block):\n",
    "    def __init__(self, **kwargs):\n",
    "        super(FancyMLP, self).__init__(**kwargs)\n",
    "        # Random weight parameters created with the get_constant are not iterated during training \n",
    "        # (i.e. constant parameters).\n",
    "        self.rand_weight = self.params.get_constant(\n",
    "            'rand_weight', \n",
    "            nd.random.uniform(shape=(20, 20))\n",
    "        )\n",
    "        self.dense = nn.Dense(20, activation='relu')\n",
    "\n",
    "    def forward(self, x):\n",
    "        x = self.dense(x)\n",
    "        \n",
    "        # Use the constant parameters created, as well as the relu and dot functions of NDArray.\n",
    "        x = nd.relu(nd.dot(x, self.rand_weight.data()) + 1)\n",
    "\n",
    "        # Reuse the fully connected layer. \n",
    "        # This is equivalent to sharing parameters with two fully connected layers.\n",
    "        x = self.dense(x)\n",
    "        \n",
    "        # Here in Control flow, we need to call asscalar to return the scalar for comparison.\n",
    "        while x.norm().asscalar() > 1:\n",
    "            x /= 2\n",
    "            \n",
    "        if x.norm().asscalar() < 0.8:\n",
    "            x *= 10\n",
    "            \n",
    "        return x.sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- In this FancyMLP model, we used constant weight `rand_weight` (note that it is not a model parameter), performed a matrix multiplication operation (`nd.dot`), and reused the same Dense layer. \n",
    "- We used the same network twice.\n",
    "  - Two networks share the same parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\n",
       "[25.522684]\n",
       "<NDArray 1 @cpu(0)>"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "net = FancyMLP()\n",
    "net.initialize()\n",
    "net(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- The example below combines examples for building a block from individual blocks, which in turn, may be blocks themselves. \n",
    "- Furthermore, we can even combine multiple strategies inside the same forward function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\n",
       "[3.853818]\n",
       "<NDArray 1 @cpu(0)>"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "class NestMLP(nn.Block):\n",
    "    def __init__(self, **kwargs):\n",
    "        super(NestMLP, self).__init__(**kwargs)\n",
    "        self.net = nn.Sequential()\n",
    "        self.net.add(\n",
    "            nn.Dense(64, activation='relu'),\n",
    "            nn.Dense(32, activation='relu')\n",
    "        )\n",
    "        self.dense = nn.Dense(16, activation='relu')\n",
    "\n",
    "    def forward(self, x):\n",
    "        return self.dense(self.net(x))\n",
    "\n",
    "chimera = nn.Sequential()\n",
    "chimera.add(\n",
    "    NestMLP(), \n",
    "    nn.Dense(20), \n",
    "    FancyMLP()\n",
    ")\n",
    "\n",
    "chimera.initialize()\n",
    "chimera(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.1.4 Compilation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- We have lots of dictionary lookups, code execution, and lots of other Pythonic things going on in what is supposed to be a high performance deep learning library. \n",
    "- The problems of Python’s Global Interpreter Lock are well known. \n",
    "  - In the context of deep learning it means that we have a super fast GPU (or multiple of them) which might have to wait until a puny single CPU core running Python gets a chance to tell it what to do next. \n",
    "  - This is clearly awful and there are many ways around it. \n",
    "  - The best way to speed up Python is by avoiding it altogether.\n",
    "- Gluon does this by allowing for Hybridization. \n",
    "  - In it, the Python interpreter executes the block the first time it’s invoked. \n",
    "  - The Gluon runtime records what is happening and the next time around it short circuits any calls to Python. \n",
    "  - This can accelerate things considerably in some cases but care needs to be taken with control flow. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.2 Parameter Management\n",
    "- Accessing parameters for debugging, diagnostics,to visualize them or to save them is the first step to understanding how to work with custom models.\n",
    "- Secondly, we want to set them in specific ways, e.g. for initialization purposes.\n",
    "  - We discuss the structure of parameter initializers.\n",
    "- Lastly, we show how this knowledge can be put to good use by building networks that share some parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\n",
       "[[ 0.00407254  0.1019081   0.02062148  0.0552136   0.07915469 -0.05606864\n",
       "  -0.1041737   0.00337543 -0.06740113 -0.06313396]\n",
       " [ 0.01474816  0.0497599   0.00468814  0.0468959   0.06075    -0.07501648\n",
       "  -0.07173473  0.06645283 -0.08554209 -0.16031   ]]\n",
       "<NDArray 2x10 @cpu(0)>"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from mxnet import init, nd\n",
    "from mxnet.gluon import nn\n",
    "\n",
    "net = nn.Sequential()\n",
    "net.add(nn.Dense(256, activation='relu'))\n",
    "net.add(nn.Dense(10))\n",
    "\n",
    "net.initialize()  # Use the default initialization method.\n",
    "\n",
    "x = nd.random.uniform(shape=(2, 20))\n",
    "net(x)            # Forward computation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.2.1 Parameter Access\n",
    "- In the case of a Sequential class we can access the parameters with ease, simply by indexing each of the layers in the network.\n",
    "- The names of the parameters sych as `dense17_weight` are very useful since they allow us to identify parameters uniquely even in a network of hundreds of layers and with nontrivial structure. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dense17_ (\n",
      "  Parameter dense17_weight (shape=(256, 20), dtype=float32)\n",
      "  Parameter dense17_bias (shape=(256,), dtype=float32)\n",
      ")\n",
      "dense18_ (\n",
      "  Parameter dense18_weight (shape=(10, 256), dtype=float32)\n",
      "  Parameter dense18_bias (shape=(10,), dtype=float32)\n",
      ")\n"
     ]
    }
   ],
   "source": [
    "print(net[0].params)\n",
    "print(net[1].params)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Parameter dense17_weight (shape=(256, 20), dtype=float32)\n",
      "\n",
      "[[-0.05357582 -0.00228109 -0.03202471 ... -0.06692369 -0.00955358\n",
      "  -0.01753462]\n",
      " [ 0.01603388  0.02262501 -0.06019409 ... -0.03063859 -0.02505398\n",
      "   0.02994981]\n",
      " [-0.06580696  0.00862081  0.0332156  ...  0.05478401 -0.06591336\n",
      "  -0.06983094]\n",
      " ...\n",
      " [ 0.02946895  0.05579274  0.01646009 ...  0.04695714  0.0208929\n",
      "  -0.06849758]\n",
      " [ 0.01405259 -0.02814856  0.02697545 ... -0.03466139 -0.00090686\n",
      "   0.02379511]\n",
      " [-0.05085108 -0.0290781   0.04582401 ...  0.00601977 -0.00817193\n",
      "   0.06228926]]\n",
      "<NDArray 256x20 @cpu(0)>\n",
      "\n",
      "[[ 0.00338574  0.04148472 -0.01888602 ... -0.06870207 -0.06303862\n",
      "  -0.04540806]\n",
      " [ 0.02585206  0.05058105  0.00044364 ... -0.00163042 -0.04103333\n",
      "   0.06294077]\n",
      " [ 0.04751863  0.06542363 -0.03117647 ...  0.00775644  0.01028717\n",
      "   0.02544965]\n",
      " ...\n",
      " [-0.02485485  0.01089642  0.0489713  ...  0.02502301  0.03442856\n",
      "  -0.03999568]\n",
      " [ 0.02737013 -0.04429683  0.03048034 ...  0.00809494  0.00763652\n",
      "   0.05087072]\n",
      " [ 0.01182987 -0.06716982  0.01266196 ...  0.01583868 -0.00265694\n",
      "  -0.00011061]]\n",
      "<NDArray 10x256 @cpu(0)>\n"
     ]
    }
   ],
   "source": [
    "print(net[0].weight)\n",
    "print(net[0].weight.data())\n",
    "print(net[1].weight.data())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Parameter dense17_bias (shape=(256,), dtype=float32)\n",
      "\n",
      "[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n",
      "<NDArray 256 @cpu(0)>\n",
      "\n",
      "[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n",
      "<NDArray 10 @cpu(0)>\n"
     ]
    }
   ],
   "source": [
    "print(net[0].bias)\n",
    "print(net[0].bias.data())\n",
    "print(net[1].bias.data())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Parameter dense17_weight (shape=(256, 20), dtype=float32)\n",
      "\n",
      "[[-0.05357582 -0.00228109 -0.03202471 ... -0.06692369 -0.00955358\n",
      "  -0.01753462]\n",
      " [ 0.01603388  0.02262501 -0.06019409 ... -0.03063859 -0.02505398\n",
      "   0.02994981]\n",
      " [-0.06580696  0.00862081  0.0332156  ...  0.05478401 -0.06591336\n",
      "  -0.06983094]\n",
      " ...\n",
      " [ 0.02946895  0.05579274  0.01646009 ...  0.04695714  0.0208929\n",
      "  -0.06849758]\n",
      " [ 0.01405259 -0.02814856  0.02697545 ... -0.03466139 -0.00090686\n",
      "   0.02379511]\n",
      " [-0.05085108 -0.0290781   0.04582401 ...  0.00601977 -0.00817193\n",
      "   0.06228926]]\n",
      "<NDArray 256x20 @cpu(0)>\n"
     ]
    }
   ],
   "source": [
    "print(net[0].params['dense17_weight'])\n",
    "print(net[0].params['dense17_weight'].data())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- We can compute the gradient with respect to the parameters. \n",
    "  - It has the same shape as the weight. \n",
    "- However, since we did not invoke backpropagation yet, the values are all 0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\n",
       "[[0. 0. 0. ... 0. 0. 0.]\n",
       " [0. 0. 0. ... 0. 0. 0.]\n",
       " [0. 0. 0. ... 0. 0. 0.]\n",
       " ...\n",
       " [0. 0. 0. ... 0. 0. 0.]\n",
       " [0. 0. 0. ... 0. 0. 0.]\n",
       " [0. 0. 0. ... 0. 0. 0.]]\n",
       "<NDArray 256x20 @cpu(0)>"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "net[0].weight.grad()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- All Parameters at Once\n",
    "  - A method `collect_params` grabs all parameters of a network in one dictionary such that we can traverse it with ease. \n",
    "    - It does so by iterating over all constituents of a block and calls `collect_params` on subblocks as needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dense17_ (\n",
      "  Parameter dense17_weight (shape=(256, 20), dtype=float32)\n",
      "  Parameter dense17_bias (shape=(256,), dtype=float32)\n",
      ")\n",
      "sequential5_ (\n",
      "  Parameter dense17_weight (shape=(256, 20), dtype=float32)\n",
      "  Parameter dense17_bias (shape=(256,), dtype=float32)\n",
      "  Parameter dense18_weight (shape=(10, 256), dtype=float32)\n",
      "  Parameter dense18_bias (shape=(10,), dtype=float32)\n",
      ")\n"
     ]
    }
   ],
   "source": [
    "# parameters only for the first layer \n",
    "print(net[0].collect_params())\n",
    "\n",
    "# parameters of the entire network\n",
    "print(net.collect_params())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\n",
       "[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n",
       "<NDArray 10 @cpu(0)>"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "net.collect_params()['dense18_bias'].data()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Regular expressions to filter out the required parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sequential5_ (\n",
      "  Parameter dense17_weight (shape=(256, 20), dtype=float32)\n",
      "  Parameter dense18_weight (shape=(10, 256), dtype=float32)\n",
      ")\n",
      "sequential5_ (\n",
      "  Parameter dense17_bias (shape=(256,), dtype=float32)\n",
      "  Parameter dense18_bias (shape=(10,), dtype=float32)\n",
      ")\n"
     ]
    }
   ],
   "source": [
    "print(net.collect_params('.*weight'))\n",
    "print(net.collect_params('.*bias'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Rube Goldberg strikes again\n",
    "  - Let’s see how the parameter naming conventions work if we nest multiple blocks inside each other."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\n",
       "[[ 6.6884764e-09 -1.9991958e-08 -4.7974535e-09 -8.7700771e-09\n",
       "  -1.6186359e-08  1.0396601e-08  1.0741704e-08  6.3689147e-09\n",
       "  -1.9723858e-09  3.0433571e-09]\n",
       " [ 8.6247640e-09 -1.8395822e-08 -2.2687403e-09 -1.6464673e-08\n",
       "  -2.4844146e-08  1.4356444e-08  1.6593912e-08  6.3606223e-09\n",
       "  -9.6643706e-09  8.3527123e-09]]\n",
       "<NDArray 2x10 @cpu(0)>"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def block1():\n",
    "    net = nn.Sequential()\n",
    "    net.add(nn.Dense(32, activation='relu'))\n",
    "    net.add(nn.Dense(16, activation='relu'))\n",
    "    return net\n",
    "\n",
    "def block2():\n",
    "    net = nn.Sequential()\n",
    "    for i in range(4):\n",
    "        net.add(block1())\n",
    "    return net\n",
    "\n",
    "rgnet = nn.Sequential()\n",
    "rgnet.add(block2())\n",
    "rgnet.add(nn.Dense(10))\n",
    "rgnet.initialize()\n",
    "rgnet(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<bound method Block.collect_params of Sequential(\n",
      "  (0): Sequential(\n",
      "    (0): Sequential(\n",
      "      (0): Dense(20 -> 32, Activation(relu))\n",
      "      (1): Dense(32 -> 16, Activation(relu))\n",
      "    )\n",
      "    (1): Sequential(\n",
      "      (0): Dense(16 -> 32, Activation(relu))\n",
      "      (1): Dense(32 -> 16, Activation(relu))\n",
      "    )\n",
      "    (2): Sequential(\n",
      "      (0): Dense(16 -> 32, Activation(relu))\n",
      "      (1): Dense(32 -> 16, Activation(relu))\n",
      "    )\n",
      "    (3): Sequential(\n",
      "      (0): Dense(16 -> 32, Activation(relu))\n",
      "      (1): Dense(32 -> 16, Activation(relu))\n",
      "    )\n",
      "  )\n",
      "  (1): Dense(16 -> 10, linear)\n",
      ")>\n",
      "sequential6_ (\n",
      "  Parameter dense19_weight (shape=(32, 20), dtype=float32)\n",
      "  Parameter dense19_bias (shape=(32,), dtype=float32)\n",
      "  Parameter dense20_weight (shape=(16, 32), dtype=float32)\n",
      "  Parameter dense20_bias (shape=(16,), dtype=float32)\n",
      "  Parameter dense21_weight (shape=(32, 16), dtype=float32)\n",
      "  Parameter dense21_bias (shape=(32,), dtype=float32)\n",
      "  Parameter dense22_weight (shape=(16, 32), dtype=float32)\n",
      "  Parameter dense22_bias (shape=(16,), dtype=float32)\n",
      "  Parameter dense23_weight (shape=(32, 16), dtype=float32)\n",
      "  Parameter dense23_bias (shape=(32,), dtype=float32)\n",
      "  Parameter dense24_weight (shape=(16, 32), dtype=float32)\n",
      "  Parameter dense24_bias (shape=(16,), dtype=float32)\n",
      "  Parameter dense25_weight (shape=(32, 16), dtype=float32)\n",
      "  Parameter dense25_bias (shape=(32,), dtype=float32)\n",
      "  Parameter dense26_weight (shape=(16, 32), dtype=float32)\n",
      "  Parameter dense26_bias (shape=(16,), dtype=float32)\n",
      "  Parameter dense27_weight (shape=(10, 16), dtype=float32)\n",
      "  Parameter dense27_bias (shape=(10,), dtype=float32)\n",
      ")\n"
     ]
    }
   ],
   "source": [
    "print(rgnet.collect_params)\n",
    "print(rgnet.collect_params())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dense21_bias\n",
      "\n",
      "[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.\n",
      " 0. 0. 0. 0. 0. 0. 0. 0.]\n",
      "<NDArray 32 @cpu(0)>\n"
     ]
    }
   ],
   "source": [
    "print(rgnet[0][1][0].bias.name)\n",
    "print(rgnet[0][1][0].bias.data())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.2.2 Parameter Initialization\n",
    "- By default, MXNet initializes the weight matrices uniformly by drawing from $U[-0.07, 0.07]$ and the bias parameters are all set to $0$.\n",
    "- MXNet’s init module provides a variety of preset initialization methods, but if we want something out of the ordinary, we need a bit of extra work."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Built-in Initialization\n",
    "  - `force_reinit` ensures that the variables are initialized again, regardless of whether they were already initialized previously."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\n",
       "[ 2.3467798e-02 -6.5989629e-03 -4.6144146e-04 -1.0800398e-03\n",
       " -2.5858415e-05 -6.9288602e-03  4.7301534e-03  1.6473899e-02\n",
       " -8.4304502e-03  3.8224545e-03  6.4377831e-03  9.0460032e-03\n",
       " -2.7124031e-04 -6.6581573e-03 -8.7738056e-03 -1.9149805e-03\n",
       "  4.9869940e-03  1.7430604e-02 -9.3654627e-03 -1.5981171e-03]\n",
       "<NDArray 20 @cpu(0)>"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "net.initialize(init=init.Normal(sigma=0.01), force_reinit=True)\n",
    "net[0].weight.data()[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- If we wanted to initialize all parameters to 1, we could do this simply by changing the initializer to `Constant(1)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\n",
       "[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]\n",
       "<NDArray 20 @cpu(0)>"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "net.initialize(init=init.Constant(1), force_reinit=True)\n",
    "net[0].weight.data()[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- We initialize the second layer to a constant value of 42 and we use the `Xavier` initializer for the weights of the first layer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "[42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.\n",
      " 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.\n",
      " 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.\n",
      " 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.\n",
      " 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.\n",
      " 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.\n",
      " 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.\n",
      " 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.\n",
      " 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.\n",
      " 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.\n",
      " 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.\n",
      " 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.\n",
      " 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.\n",
      " 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42. 42.\n",
      " 42. 42. 42. 42.]\n",
      "<NDArray 256 @cpu(0)>\n",
      "\n",
      "[ 0.08490172  0.13223866  0.01630534 -0.00707628 -0.03077595  0.14420772\n",
      "  0.13430956  0.07363294  0.02899179 -0.13734338 -0.11237526  0.08715159\n",
      " -0.02431636  0.12052891  0.0830339   0.06951596  0.05713288 -0.06902333\n",
      "  0.12277207 -0.10455534]\n",
      "<NDArray 20 @cpu(0)>\n"
     ]
    }
   ],
   "source": [
    "net[1].initialize(init=init.Constant(42), force_reinit=True)\n",
    "net[0].weight.initialize(init=init.Xavier(), force_reinit=True)\n",
    "print(net[1].weight.data()[0])\n",
    "print(net[0].weight.data()[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Custom Initialization\n",
    "  - Sometimes, the initialization methods we need are not provided in the init module. \n",
    "  - At this point, we can implement a subclass of the Initializer class so that we can use it like any other initialization method. \n",
    "  - Usually, we only need to implement the `_init_weight` function and modify the incoming `NDArray` according to the initial result. \n",
    "  - In the example below, we pick a decidedly bizarre and nontrivial distribution, just to prove the point. \n",
    "  - We draw the coefficients from the following distribution: $$ \\begin{aligned} w \\sim \\begin{cases} U[5, 10] & \\text{ with probability } \\frac{1}{4} \\\\ 0 & \\text{ with probability } \\frac{1}{2} \\\\ U[-10, -5] & \\text{ with probability } \\frac{1}{4} \\end{cases} \\end{aligned} $$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Init dense17_weight (256, 20)\n",
      "Init dense18_weight (10, 256)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "\n",
       "[-9.572826   7.9411488 -7.953664   0.        -0.        -7.483777\n",
       "  9.6598015 -5.8997717 -7.205085   8.736895  -0.        -0.\n",
       " -8.978939  -0.        -0.        -0.        -0.        -0.\n",
       "  8.936142  -0.       ]\n",
       "<NDArray 20 @cpu(0)>"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "class MyInit(init.Initializer):\n",
    "    def _init_weight(self, name, data):\n",
    "        print('Init', name, data.shape)\n",
    "        data[:] = nd.random.uniform(low=-10, high=10, shape=data.shape)\n",
    "        data *= data.abs() >= 5\n",
    "\n",
    "net.initialize(MyInit(), force_reinit=True)\n",
    "net[0].weight.data()[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Since `data()` returns an `NDArray`, we can access it just like any other matrix. \n",
    "- If you want to adjust parameters within an `autograd` scope, you need to use `set_data` to avoid confusing the automatic differentiation mechanics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\n",
       "[42.         8.941149  -6.953664   1.         1.        -6.483777\n",
       " 10.6598015 -4.8997717 -6.205085   9.736895   1.         1.\n",
       " -7.978939   1.         1.         1.         1.         1.\n",
       "  9.936142   1.       ]\n",
       "<NDArray 20 @cpu(0)>"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "net[0].weight.data()[:] += 1\n",
    "net[0].weight.data()[0,0] = 42\n",
    "net[0].weight.data()[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.2.3 Tied Parameters\n",
    "- In some cases, we want to share model parameters across multiple layers. \n",
    "- In the following we allocate a dense layer and then use its parameters specifically to set those of another layer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "[1. 1. 1. 1. 1. 1. 1. 1.]\n",
      "<NDArray 8 @cpu(0)>\n",
      "\n",
      "[-0.03439966 -0.05555296  0.0232332  -0.02662065  0.04434159 -0.05426525\n",
      "  0.01500529 -0.06945959]\n",
      "<NDArray 8 @cpu(0)>\n",
      "\n",
      "[-0.03439966 -0.05555296  0.0232332  -0.02662065  0.04434159 -0.05426525\n",
      "  0.01500529 -0.06945959]\n",
      "<NDArray 8 @cpu(0)>\n",
      "\n",
      "[1. 1. 1. 1. 1. 1. 1. 1.]\n",
      "<NDArray 8 @cpu(0)>\n"
     ]
    }
   ],
   "source": [
    "net = nn.Sequential()\n",
    "# we need to give the shared layer a name such that we can reference its parameters\n",
    "shared = nn.Dense(8, activation='relu')\n",
    "net.add(\n",
    "    nn.Dense(8, activation='relu'),\n",
    "    shared,\n",
    "    nn.Dense(8, activation='relu', params=shared.params),\n",
    "    nn.Dense(10)\n",
    ")\n",
    "net.initialize()\n",
    "\n",
    "x = nd.random.uniform(shape=(2, 20))\n",
    "net(x)\n",
    "\n",
    "# Check whether the parameters are the same\n",
    "print(net[1].weight.data()[0] == net[2].weight.data()[0])\n",
    "print(net[1].weight.data()[0])\n",
    "print(net[2].weight.data()[0])\n",
    "\n",
    "# And make sure that they're actually the same object rather than just having the same value.\n",
    "net[1].weight.data()[0,0] = 100\n",
    "print(net[1].weight.data()[0] == net[2].weight.data()[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4.3 Deferred Initialization\n",
    "- In the previous examples...\n",
    "  - We defined the network architecture ***with no regard to the input dimensionality***.\n",
    "  - We added layers ***without regard to the output dimension of the previous layer***.\n",
    "  - We even ‘initialized’ these parameters ***without knowing how many parameters were to initialize***.\n",
    "- The ability to set parameters without the need to know what the dimensionality is can greatly simplify statistical modeling. \n",
    "- In what follows, we will discuss how this works using initialization as an example. \n",
    "- After all, we cannot initialize variables that we don’t know exist."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.3.1 Instantiating a Network"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mxnet import init, nd\n",
    "from mxnet.gluon import nn\n",
    "\n",
    "def getnet():\n",
    "    net = nn.Sequential()\n",
    "    net.add(nn.Dense(256, activation='relu'))\n",
    "    net.add(nn.Dense(10))\n",
    "    return net\n",
    "\n",
    "net = getnet()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- At this point, each layer needs weights and bias, albeit of unspecified dimensionality. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<bound method Block.collect_params of Sequential(\n",
      "  (0): Dense(None -> 256, Activation(relu))\n",
      "  (1): Dense(None -> 10, linear)\n",
      ")>\n",
      "sequential18_ (\n",
      "  Parameter dense52_weight (shape=(256, 0), dtype=float32)\n",
      "  Parameter dense52_bias (shape=(256,), dtype=float32)\n",
      "  Parameter dense53_weight (shape=(10, 0), dtype=float32)\n",
      "  Parameter dense53_bias (shape=(10,), dtype=float32)\n",
      ")\n"
     ]
    }
   ],
   "source": [
    "print(net.collect_params)\n",
    "print(net.collect_params())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Trying to access `net[0].weight.data()` at this point would trigger a runtime error stating that the network needs initializing before it can do anything."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "ename": "RuntimeError",
     "evalue": "Parameter 'dense52_weight' has not been initialized. Note that you should initialize parameters and create Trainer with Block.collect_params() instead of Block.params because the later does not include Parameters of nested child Blocks",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mRuntimeError\u001b[0m                              Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-46-59ea5453a5fc>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mnet\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mweight\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m~/anaconda3/envs/gluon/lib/python3.6/site-packages/mxnet/gluon/parameter.py\u001b[0m in \u001b[0;36mdata\u001b[0;34m(self, ctx)\u001b[0m\n\u001b[1;32m    391\u001b[0m         \u001b[0mNDArray\u001b[0m \u001b[0mon\u001b[0m \u001b[0mctx\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    392\u001b[0m         \"\"\"\n\u001b[0;32m--> 393\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_check_and_get\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_data\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mctx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    394\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    395\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mlist_data\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/anaconda3/envs/gluon/lib/python3.6/site-packages/mxnet/gluon/parameter.py\u001b[0m in \u001b[0;36m_check_and_get\u001b[0;34m(self, arr_list, ctx)\u001b[0m\n\u001b[1;32m    187\u001b[0m             \u001b[0;34m\"with Block.collect_params() instead of Block.params \"\u001b[0m\u001b[0;31m \u001b[0m\u001b[0;31m\\\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    188\u001b[0m             \u001b[0;34m\"because the later does not include Parameters of \"\u001b[0m\u001b[0;31m \u001b[0m\u001b[0;31m\\\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 189\u001b[0;31m             \"nested child Blocks\"%(self.name))\n\u001b[0m\u001b[1;32m    190\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    191\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0m_load_init\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mctx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mRuntimeError\u001b[0m: Parameter 'dense52_weight' has not been initialized. Note that you should initialize parameters and create Trainer with Block.collect_params() instead of Block.params because the later does not include Parameters of nested child Blocks"
     ]
    }
   ],
   "source": [
    "net[0].weight.data()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "sequential18_ (\n",
       "  Parameter dense52_weight (shape=(256, 0), dtype=float32)\n",
       "  Parameter dense52_bias (shape=(256,), dtype=float32)\n",
       "  Parameter dense53_weight (shape=(10, 0), dtype=float32)\n",
       "  Parameter dense53_bias (shape=(10,), dtype=float32)\n",
       ")"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "net.initialize()\n",
    "net.collect_params()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Nothing really changed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "ename": "DeferredInitializationError",
     "evalue": "Parameter 'dense52_weight' has not been initialized yet because initialization was deferred. Actual initialization happens during the first forward pass. Please pass one batch of data through the network before accessing Parameters. You can also avoid deferred initialization by specifying in_units, num_features, etc., for network layers.",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mDeferredInitializationError\u001b[0m               Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-49-59ea5453a5fc>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mnet\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mweight\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m~/anaconda3/envs/gluon/lib/python3.6/site-packages/mxnet/gluon/parameter.py\u001b[0m in \u001b[0;36mdata\u001b[0;34m(self, ctx)\u001b[0m\n\u001b[1;32m    391\u001b[0m         \u001b[0mNDArray\u001b[0m \u001b[0mon\u001b[0m \u001b[0mctx\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    392\u001b[0m         \"\"\"\n\u001b[0;32m--> 393\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_check_and_get\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_data\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mctx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    394\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    395\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mlist_data\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/anaconda3/envs/gluon/lib/python3.6/site-packages/mxnet/gluon/parameter.py\u001b[0m in \u001b[0;36m_check_and_get\u001b[0;34m(self, arr_list, ctx)\u001b[0m\n\u001b[1;32m    181\u001b[0m                 \u001b[0;34m\"Please pass one batch of data through the network before accessing Parameters. \"\u001b[0m\u001b[0;31m \u001b[0m\u001b[0;31m\\\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    182\u001b[0m                 \u001b[0;34m\"You can also avoid deferred initialization by specifying in_units, \"\u001b[0m\u001b[0;31m \u001b[0m\u001b[0;31m\\\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 183\u001b[0;31m                 \"num_features, etc., for network layers.\"%(self.name))\n\u001b[0m\u001b[1;32m    184\u001b[0m         raise RuntimeError(\n\u001b[1;32m    185\u001b[0m             \u001b[0;34m\"Parameter '%s' has not been initialized. Note that \"\u001b[0m\u001b[0;31m \u001b[0m\u001b[0;31m\\\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mDeferredInitializationError\u001b[0m: Parameter 'dense52_weight' has not been initialized yet because initialization was deferred. Actual initialization happens during the first forward pass. Please pass one batch of data through the network before accessing Parameters. You can also avoid deferred initialization by specifying in_units, num_features, etc., for network layers."
     ]
    }
   ],
   "source": [
    "net[0].weight.data()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Only once we provide the network with some data, we see a difference. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "sequential18_ (\n",
       "  Parameter dense52_weight (shape=(256, 20), dtype=float32)\n",
       "  Parameter dense52_bias (shape=(256,), dtype=float32)\n",
       "  Parameter dense53_weight (shape=(10, 256), dtype=float32)\n",
       "  Parameter dense53_bias (shape=(10,), dtype=float32)\n",
       ")"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = nd.random.uniform(shape=(2, 20))\n",
    "net(x) # Forward computation.\n",
    "\n",
    "net.collect_params()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\n",
       "[[-0.05247737 -0.01900016  0.06498937 ...  0.02672191 -0.02730501\n",
       "   0.03611466]\n",
       " [ 0.0618015   0.03916474 -0.05941451 ...  0.04577643 -0.0453134\n",
       "  -0.04038748]\n",
       " [ 0.06184389  0.04633274  0.03094608 ...  0.00510379  0.05605743\n",
       "  -0.05085221]\n",
       " ...\n",
       " [-0.06550431  0.04614966  0.04391201 ... -0.01563684  0.04479967\n",
       "   0.06039421]\n",
       " [-0.06207634  0.00493836 -0.0689486  ...  0.02575751 -0.05235828\n",
       "   0.05903549]\n",
       " [-0.01011717  0.01382479  0.02665275 ... -0.05540304 -0.02307985\n",
       "   0.00403536]]\n",
       "<NDArray 256x20 @cpu(0)>"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "net[0].weight.data()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.3.2 Deferred Initialization in Practice"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
    "class MyInit(init.Initializer):\n",
    "    def _init_weight(self, name, data):\n",
    "        print('Init', name, data.shape)\n",
    "        # The actual initialization logic is omitted here.\n",
    "\n",
    "net = getnet()\n",
    "net.initialize(init=MyInit())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Init dense54_weight (256, 20)\n",
      "Init dense55_weight (10, 256)\n"
     ]
    }
   ],
   "source": [
    "x = nd.random.uniform(shape=(2, 20))\n",
    "y = net(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- When performing a forward calculation based on the input `x`, the system can automatically infer the shape of the weight parameters of all layers based on the shape of the input. \n",
    "- Once the system has created these parameters, it calls the `MyInit` instance to initialize them before proceeding to the forward calculation.\n",
    "- This initialization will only be called when completing the initial forward calculation. \n",
    "- After that, we will not re-initialize when we run the forward calculation net(x)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = net(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.3.3 Forced Initialization\n",
    "- Deferred initialization does not occur if the system knows the shape of all parameters when calling the initialize function. This can occur in two cases:\n",
    "  - We’ve already seen some data and we just want to reset the parameters.\n",
    "  - We specificed all input and output dimensions of the network when defining it.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Init dense54_weight (256, 20)\n",
      "Init dense55_weight (10, 256)\n"
     ]
    }
   ],
   "source": [
    "net.initialize(init=MyInit(), force_reinit=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- We specify the `in_units` so that initialization can occur immediately once initialize is called"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Init dense56_weight (256, 20)\n",
      "Init dense57_weight (10, 256)\n"
     ]
    }
   ],
   "source": [
    "net = nn.Sequential()\n",
    "net.add(nn.Dense(256, in_units=20, activation='relu'))\n",
    "net.add(nn.Dense(10, in_units=256))\n",
    "net.initialize(init=MyInit())"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}