{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " \n", "## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n", "###
Author: Kseniia Terekhova, ODS Slack Kseniia\n", " \n", "##
Tutorial\n", "##
A little more info about NumPy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Introduction/justification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Though NumPy was not denoted as prerequisite for mlcourse.ai, there is no doubt that most participants are familiar with it and have no difficulties in performing common actions. However, pieces of interesting information encountered here and there seem to be worth sharing. No one knows everything, NumPy was not covered in the course in details - but it is a powerful scientific library that can make many mathematical calculations simpler and nicer.
\n", "Links to materials the tutorial is based on can be found in the end of the notebook, in the \"References\" section. And sure, I'm not going to retell NumPy quickstart tutorial." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. NumPy performance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is not only convenient API that makes NumPy so useful for scientific purposes, but also its performance characteristics. Python is not the most quick and memory-economical language. When you are often getting MemoryError while working with large ML datasets it does not look as a minor disadvantage.
\n", "Let's first compare amounts of bytes taken by standard Python list and identical NumPy array. " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "import sys\n", "\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python list size: 8.583 MB\n", "Numpy array size: 7.629 MB\n", "ArraySize/ListSize ratio: 0.89\n" ] } ], "source": [ "mb = 1024 * 1024\n", "\n", "python_list = list(range(0, 1000000))\n", "numpy_array = np.array(range(0, 1000000))\n", "\n", "print(\"Python list size: {0:.3f} MB\".format(sys.getsizeof(python_list) / mb))\n", "print(\"Numpy array size: {0:.3f} MB\".format(numpy_array.nbytes / mb, \"MB\"))\n", "print(\"ArraySize/ListSize ratio: {0:.2f}\".format(numpy_array.nbytes / sys.getsizeof(python_list)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "11% of gain is something noticeable. But were Python lists implemented so inefficiently? No, actually, they just were implemented differently." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While NumPy contains data in a continious area in memory, a Python list stores only pointers to the real data. And yes, not only \"list in Python is more than just list\" but also \"an integer in Python is more than just integer\". " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "

Matrix broadcasting in NumPy is a set of rules that allowes to expand two matrixes with different dimentions to match shapes of each other, in order to perform element-by-element operations. What is important, this set of rules is a \"virtual\" mechanism, that just allows to understand how matrixes will interact. No real expansion and memory allocation is performed.

\n", "

Its simplest case is summing a matrix with a scalar number, that will be added to each element of the matrix:

\n", "Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is padded with ones on its leading (left) side.
\n", "Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
\n", "Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, adding a two-dimensional array to a one-dimensional array is performed this way:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "M: (2, 3)\n", "[[1. 1. 1.]\n", " [1. 1. 1.]]\n", "\n", "a: (3,)\n", "[0 1 2]\n" ] } ], "source": [ "M = np.ones((2, 3))\n", "a = np.arange(3)\n", "print(\"M:\", M.shape)\n", "print(M)\n", "print()\n", "print(\"a:\", a.shape)\n", "print(a)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The shape of a is pad on the left, as it has fewer dimensions:
\n", " M.shape -> (2, 3)
\n", " a.shape -> (1, 3)

\n", "The the 1-dimension of a is stretched to match M:
\n", " M.shape -> (2, 3)
\n", " a.shape -> (2, 3)
\n", "\n", "Now the shapes match and matrixes can be summed:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1., 2., 3.],\n", " [1., 2., 3.]])" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "M + a" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's what happens when both arrays need to be broadcasted:" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a: (3, 1)\n", "[[0]\n", " [1]\n", " [2]]\n", "\n", "b: (3,)\n", "[0 1 2]\n" ] } ], "source": [ "a = np.arange(3).reshape((3, 1))\n", "b = np.arange(3)\n", "\n", "print(\"a:\", a.shape)\n", "print(a)\n", "print()\n", "print(\"b:\", b.shape)\n", "print(b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The shape of b is pad left with ones:
\n", "a.shape -> (3, 1)
\n", "b.shape -> (1, 3)
\n", "
\n", "Then both matrixes has dimension to be expanded:
\n", "a.shape -> (3, 3)
\n", "b.shape -> (3, 3)
\n", "
\n", "Then the matrixes can be easily summed:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0, 1, 2],\n", " [1, 2, 3],\n", " [2, 3, 4]])" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a + b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, broadcasting is not always possible:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "M: (3, 2)\n", "[[1. 1.]\n", " [1. 1.]\n", " [1. 1.]]\n", "\n", "a: (3,)\n", "[0 1 2]\n", "\n" ] } ], "source": [ "M = np.ones((3, 2))\n", "print(\"M:\", M.shape)\n", "print(M)\n", "print()\n", "\n", "a = np.arange(3)\n", "print(\"a:\", a.shape)\n", "print(a)\n", "print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "M.shape -> (3, 2)
\n", "a.shape -> (1, 3)
\n", "
\n", "M.shape -> (3, 2)
\n", "a.shape -> (3, 3)
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Matrixes shape does not match, the error is raised when trying to perform operations with them:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "ename": "ValueError", "evalue": "operands could not be broadcast together with shapes (3,2) (3,) ", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mM\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0ma\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mValueError\u001b[0m: operands could not be broadcast together with shapes (3,2) (3,) " ] } ], "source": [ "M + a" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The possible solution in such situation is padding \"a\" with 1 dimention in the right manually. This way the Rule 1 will be skipped and according to the Rule 2 NumPy will just expand matrix to the needed size: " ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a: (3, 1)\n", "[[0]\n", " [1]\n", " [2]]\n", "\n" ] } ], "source": [ "a = a[:, np.newaxis]\n", "\n", "print(\"a:\", a.shape)\n", "print(a)\n", "print()" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1., 1.],\n", " [2., 2.],\n", " [3., 3.]])" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "M + a" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. What is np.newaxis?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the example in the previous section used a np.newaxis constant was used. What is it? Actually it is None." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.newaxis is None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, just a convenient alias. The np.newaxis constant is useful when converting a 1D array into a row vector or a column vector, by adding new dimensions from left or right side:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(3,)" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "arr = np.arange(3)\n", "arr.shape" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1, 3)\n", "[[0 1 2]]\n" ] } ], "source": [ "row_vec = arr[np.newaxis, :]\n", "print(row_vec.shape)\n", "print(row_vec)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(3, 1)\n", "[[0]\n", " [1]\n", " [2]]\n" ] } ], "source": [ "col_vec = arr[:, np.newaxis]\n", "print(col_vec.shape)\n", "print(col_vec)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In fact it is similar to np.reshape(-1, 1) and np.reshape(1, -1) down to minor implementation details. But the np.newaxis allows to stack dimentions using slice syntax, without specifying original shape: " ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "M = np.ones((5, 5))" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1, 5, 5, 1, 1)" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "M[np.newaxis, :, :, np.newaxis, np.newaxis].shape" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1, 5, 5, 1, 1)" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "M[np.newaxis, ..., np.newaxis, np.newaxis].shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And reshape allows to use -1 only once, requiring to explicitly pass original shape when working with multidimentional arrays:" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "ename": "ValueError", "evalue": "can only specify one unknown dimension", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mM\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreshape\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m \u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mValueError\u001b[0m: can only specify one unknown dimension" ] } ], "source": [ "M.reshape(1, -1, -1, 1, 1 ).shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will work:" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1, 5, 5, 1, 1)" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "M.reshape(1, *M.shape, 1, 1 ).shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "but doesn't it look a little bit clumsy?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Adding several new dimensions is useful in ML when working with, for example, convolutional neural networks. Such frameworks as Pytorch allows to initialize its \"Tensors\" from numpy arrays, but often requires input in the form \"minibatch × in_channels × iW\" or \"minibatch × in_channels × iH × iW\" (torch.nn.functional). There, minibatch and in_channels can be equal to 1, but they must be present. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. A quick note about matrix multiplication" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, summing is not the only operation that can be applied to matrixes; a number of arithmetic operations can be used, along with several “universal functions”. The operation that I would like to pay some attention is multiplication, i.e. '*'. As other arithmetic operations, in NumPy it is applied elementwise:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A:\n", "[[2 2]\n", " [2 2]]\n", "\n", "B:\n", "[[3 3]\n", " [3 3]]\n", "\n", "A*B:\n", "[[6 6]\n", " [6 6]]\n" ] } ], "source": [ "A = np.full((2, 2), 2)\n", "print(\"A:\")\n", "print(A)\n", "print()\n", "\n", "B = np.full((2, 2), 3)\n", "print(\"B:\")\n", "print(B)\n", "print()\n", "\n", "print(\"A*B:\")\n", "print(A*B)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But when speaking about matrix multiplication, especially in linear algebra (and ML) another operation is often implied, the matrix product, that is defined this way (formula from wikipedia):" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$A = \\begin{pmatrix} a_{11}, a_{12} & \\cdots & a_{1m}\\\\a_{21}, a_{22} & \\cdots & a_{2m} \\\\ \\vdots & \\ddots & \\vdots \\\\ a_{n1}, a_{n2} & \\cdots & a_{nm} \\end{pmatrix}, B = \\begin{pmatrix} b_{11}, b_{12} & \\cdots & b_{1p}\\\\b_{21}, b_{22} & \\cdots & b_{2p} \\\\ \\vdots & \\ddots & \\vdots \\\\ b_{m1}, b_{m2} & \\cdots & b_{mp} \\end{pmatrix}$$\n", "Matrix product C = AB:\n", "$$C = \\begin{pmatrix} c_{11}, c_{12} & \\cdots & c_{1p} \\\\c_{21}, c_{22} & \\cdots & c_{2p} \\\\ \\vdots & \\ddots & \\vdots \\\\ c_{n1}, c_{n2} & \\cdots & c_{np} \\end{pmatrix}$$ \n", "$c_{ij} = a_{i1}b_{1j} + ... + a_{im}b_{mj} = \\sum_{k=1}^{m} a_{ik}b_{kj}$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The numpy.dot() function or \"@\" shortcut is used for this purpose in NumPy. That is unpleasant to confuse these two operations, especially when matrix broadcasting exists:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A:\n", "[[1 2 3]\n", " [4 5 6]\n", " [7 8 9]]\n", "\n", "B:\n", "[3 3 3]\n", "\n", "A*B:\n", "[[ 3 6 9]\n", " [12 15 18]\n", " [21 24 27]]\n", "\n", "A@B:\n", "[18 45 72]\n" ] } ], "source": [ "A = np.arange(1, 10).reshape(3,3)\n", "print(\"A:\")\n", "print(A)\n", "print()\n", "\n", "B = np.full((3,), 3)\n", "print(\"B:\")\n", "print(B)\n", "print()\n", "\n", "print(\"A*B:\")\n", "print(A*B)\n", "\n", "print()\n", "print(\"A@B:\")\n", "print(A@B)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, be attentive :)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5. No mess with np.meshgrid" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We used np.meshgrid function somewhere in the course, but without any explanations. That's pity in my opinion, as it is not so easy to grasp what it does from the NumPy documentation:\n", "
\n", "numpy.meshgrid(*xi, **kwargs)
\n", "\n", "Return coordinate matrices from coordinate vectors.
\n", "\n", "Make N-D coordinate arrays for vectorized evaluations of N-D scalar/vector fields over N-D grids, given one-dimensional coordinate arrays x1, x2,…, xn.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nevertheless, the first time I've saw its usage was in the article about Spatial Transformer Networks, on the step with \"Identity meshgrid\" and \"Transformed meshgrid\". So, it can be a useful stuff.