{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"code_folding": [
1
],
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Lecture 3.1: Matrix multiplication part 1"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Syllabus\n",
"**Week 1:** Intro week, floating point, vector norms, matrix multiplication\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Recap of the previous lecture\n",
"- Concept of floating point\n",
"- Basic vector norms(p-norm, Manhattan distance, 2-norm, Chebyshev norm)\n",
"- A short demo on $L_1$-norm minimization\n",
"- Concept of forward/backward error"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Today lecture\n",
"Today we will talk about:\n",
"- Matrices\n",
"- Matrix multiplication\n",
"- Matrix norms, operator norms\n",
"- unitary matrices, unitary invariant norms\n",
"- Concept of block algorithms for NLA: why and how.\n",
"- Complexity of matrix multiplication"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Matrix-by-matrix product\n",
"\n",
"Consider composition of two linear operators:\n",
"\n",
"1. $y = Bx$\n",
"2. $z = Ay$\n",
"\n",
"Then, $z = Ay = A B x = C x$, where $C$ is the **matrix-by-matrix product**.\n",
"\n",
"A product of an $n \\times k$ matrix $A$ and a $k \\times m$ matrix $B$ is a $n \\times m$ matrix $C$ with the elements \n",
"$$\n",
" c_{ij} = \\sum_{s=1}^k a_{is} b_{sj}, \\quad i = 1, \\ldots, n, \\quad j = 1, \\ldots, m \n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Complexity of MM\n",
"Complexity of a naive algorithm for MM is $\\mathcal{O}(n^3)$. \n",
"\n",
"Matrix-by-matrix product is the **core** for almost all efficient algorithms in linear algebra. \n",
"\n",
"Basically, all the NLA algorithms are reduced to a sequence of matrix-by-matrix products, \n",
"\n",
"so efficient implementation of MM reduces the complexity of numerical algorithms by the same factor. \n",
"\n",
"However, implementing MM is not easy at all!"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Efficient implementation for MM\n",
"Is it easy to multiply a matrix by a matrix? \n",
"\n",
"The answer is: **no**, if you want it as fast as possible, \n",
"\n",
"using the computers that are at hand."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Demo\n",
"Let us do a short demo and compare a `np.dot()` procedure which in my case uses MKL with a hand-written matrix-by-matrix routine in Python and also its Cython version (and also gives a very short introduction to Cython)."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"code_folding": [],
"collapsed": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"import numpy as np\n",
"def matmul(a, b):\n",
" n = a.shape[0]\n",
" k = a.shape[1]\n",
" m = b.shape[1] #c[i, :] += \n",
" c = np.zeros((n, m))\n",
" #Transpose B here, but it is N^2 operations\n",
" for i in range(n): #This is N^3\n",
" for j in range(m):\n",
" for s in range(k):\n",
" c[i, j] += a[i, s] * b[s, j]"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"%reload_ext cythonmagic"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"import numpy as np\n",
"from numba import jit\n",
"@jit\n",
"def numba_matmul(a, b):\n",
" n = a.shape[0]\n",
" k = a.shape[1]\n",
" m = b.shape[1]\n",
" c = np.zeros((n, m))\n",
" for i in range(n):\n",
" for j in range(m):\n",
" for s in range(k):\n",
" c[i, j] += a[i, s] * b[s, j]\n",
" return c"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Then we just compare computational times.\n",
"\n",
"Guess the answer."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The slowest run took 294.61 times longer than the fastest. This could mean that an intermediate result is being cached \n",
"1 loops, best of 3: 952 µs per loop\n",
"10000 loops, best of 3: 112 µs per loop\n"
]
}
],
"source": [
"n = 100\n",
"a = np.random.randn(n, n)\n",
"b = np.random.randn(n, n)\n",
"%timeit c = numba_matmul(a, b)\n",
"#%timeit cf = cython_matmul(a, b)\n",
"%timeit c = np.dot(a, b)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Why it is so? \n",
"There are two important issues:\n",
"\n",
"- Computers are more and more parallel (multicore, graphics processing units)\n",
"- The memory pyramid: there is a whole hierarchy of levels "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Memory architecture\n",
"\n",
"Fast memory is small, bigger memory is slow. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"- Data fits into the fast memory: \n",
" load all data, compute\n",
"- Data does not fit into the fast memory: \n",
" load data by chunks, compute, load again"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"We need to reduce the number of read/write operations! \n",
"\n",
"This is typically achieved in efficient implementations of the BLAS libraries, one of which (Intel MKL) we now use."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## BLAS\n",
"Basic linear algebra operations (**BLAS**) have three levels:\n",
"1. BLAS-1, operations like $c = a + b$\n",
"2. BLAS-2, operations like matrix-by-vector product\n",
"3. BLAS-3, matrix-by-matrix product\n",
"\n",
"What is the principal differences between them?"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"The main difference is the number of operations vs. the number of input data!\n",
"\n",
"1. BLAS-1: $\\mathcal{O}(n)$ data, $\\mathcal{O}(n)$ operations\n",
"2. BLAS-2: $\\mathcal{O}(n^2)$ data, $\\mathcal{O}(n^2)$ operations\n",
"3. BLAS-3: $\\mathcal{O}(n^2)$ data, $\\mathcal{O}(n^3)$ operations"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"**Remark**: a quest for $\\mathcal{O}(n^2)$ matrix-by-matrix multiplication algorithm is not yet done.\n",
"\n",
"Strassen gives $\\mathcal{O}(n^{2.78...})$ \n",
"\n",
"World record $\\mathcal{O}(n^{2.37})$ [Reference](http://arxiv.org/pdf/1401.7714v1.pdf) \n",
"\n",
"The constant is unfortunately too big to make it practical!"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Memory hierarchy\n",
"How we can use memory hierarchy? "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Break the matrix into blocks! ($2 \\times 2$ is an **illustration**) \n",
"\n",
"$\n",
" A = \\begin{bmatrix}\n",
" A_{11} & A_{12} \\\\\n",
" A_{21} & A_{22}\n",
" \\end{bmatrix}$, $B = \\begin{bmatrix}\n",
" B_{11} & B_{12} \\\\\n",
" B_{21} & B_{22}\n",
" \\end{bmatrix}$"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Then, \n",
"\n",
"$AB$ = $\\begin{bmatrix}A_{11} B_{11} + A_{12} B_{21} & A_{11} B_{12} + A_{12} B_{22} \\\\\n",
" A_{21} B_{11} + A_{22} B_{21} & A_{21} B_{12} + A_{22} B_{22}\\end{bmatrix}.$\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"If $A_{11}, B_{11}$ and their product fit into the cache memory (which is 1024 Kb for the [Haswell Intel Chip](http://en.wikipedia.org/wiki/List_of_Intel_Core_i7_microprocessors#.22Haswell-H.22_.28MCP.2C_quad-core.2C_22_nm.29)), then we load them only once into the memory. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Key point\n",
"The number of read/writes is reduced by a factor $\\sqrt{M}$, where $M$ is the cache size. \n",
"\n",
"- Have to do linear algebra in terms of blocks! \n",
"- So, you can not even do Gaussian elimination as usual (or just suffer 10x performance loss)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Parallelization\n",
"The blocking has also deep connection with parallel computations. \n",
"Consider adding two vectors:\n",
"$$ c = a + b$$\n",
"and we have two processors. \n",
"\n",
"How fast can we go? \n",
"\n",
"Of course, not faster then twice."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"100 loops, best of 3: 2.82 ms per loop\n",
"100 loops, best of 3: 2.82 ms per loop\n"
]
}
],
"source": [
"## This demo requires Anaconda distribution to be installed\n",
"import mkl\n",
"import numpy as np\n",
"n = 1000000\n",
"a = np.random.randn(n)\n",
"mkl.set_num_threads(1)\n",
"%timeit a + a\n",
"mkl.set_num_threads(2)\n",
"%timeit a + a"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"100 loops, best of 3: 10.4 ms per loop\n",
"100 loops, best of 3: 5.73 ms per loop\n"
]
}
],
"source": [
"## This demo requires Anaconda distribution to be installed\n",
"import mkl\n",
"n = 500\n",
"a = np.random.randn(n, n)\n",
"mkl.set_num_threads(1)\n",
"%timeit a.dot(a)\n",
"mkl.set_num_threads(4)\n",
"%timeit a.dot(a)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Typically, two cases are distinguished: \n",
"1. Shared memory (i.e., multicore on every desktop/smartphone)\n",
"2. Distributed memory (i.e. each processor has its own memory, can send information through a network)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"In both cases, the efficiency is governed by a \n",
"\n",
"**memory bandwidth**: \n",
"\n",
"I.e., for BLAS-1,2 routines (like sum of two vectors) reads/writes take all the time. \n",
"\n",
"For BLAS-3 routines, the speedup can be obtained that is more noticable. \n",
"\n",
"For large-scale clusters (>100 000 cores, see the [Top500 list](http://www.top500.org/lists/)) there is still scaling."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Communication-avoiding algorithms\n",
"A new direction in NLA is **communication-avoiding** algorithms (i.e. Hadoop), when you have many computing nodes, but very slow communication with limited communication capabilities. \n",
"\n",
"This requires **absolutely different algorithms**.\n",
"\n",
"This can be an interesting **project** (i.e. do NLA in a cloud)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Summary of MM part\n",
"- MM is the core of NLA. You have to think in block terms, if you want high efficiency\n",
"- This is all about computer memory hierarchy\n",
"- $\\mathcal{O}(n^{2 + \\epsilon})$ complexity hypothesis is not proven or disproven yet.\n",
"\n",
"Now we go to [matrix norms](lecture-3.2.ipynb)."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false,
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from IPython.core.display import HTML\n",
"def css_styling():\n",
" styles = open(\"./styles/custom.css\", \"r\").read()\n",
" return HTML(styles)\n",
"css_styling()"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 0
}