{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Analysis and Machine Learning Applications for Physicists" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Material for a* [*University of Illinois*](http://illinois.edu) *course offered by the* [*Physics Department*](https://physics.illinois.edu). *This content is maintained on* [*GitHub*](https://github.com/illinois-mla) *and is distributed under a* [*BSD3 license*](https://opensource.org/licenses/BSD-3-Clause).\n", "\n", "[Table of contents](Contents.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Jupyter Notebooks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Jupyter is an interactive front-end to a rich ecosystem of python packages that support machine learning and data science tasks. We will use the core set of packages shown below for this course:\n", "![MLS packages](img/JupyterNumpy/packages.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All course notebooks begin with the following boilerplate to import and configure some frequently used packages:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns; sns.set()\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You might see a warning about \"building the font cache\" if this is your first time using the course conda environment, but any other message probably indicates that you do not have your environment set up correctly yet." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the actual package versions we are using:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "matplotlib 2.2.3\n", "seaborn 0.9.0\n", "numpy 1.14.5\n", "pandas 0.23.4\n" ] } ], "source": [ "import matplotlib\n", "print('matplotlib', matplotlib.__version__)\n", "print('seaborn', sns.__version__)\n", "print('numpy', np.__version__)\n", "print('pandas', pd.__version__)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You will get plenty of experience with notebooks during this course so we will start with just a few basic survival skills. For a deeper dive, start with [IPython: Beyond Normal Python](https://jakevdp.github.io/PythonDataScienceHandbook/01.00-ipython-beyond-normal-python.html) in the [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Type `command?` to display help on `command`. Try this now for the built-in `range` function then dismiss the help window." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hit the tab key to display possible completions of a partial name. Try this now by typing `np.arc` below then hitting TAB." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Your notebook will sometimes be unresponsive while executing a long command that you don't want to wait for. Use the *Interrupt* command in the *Kernel* menu to recover from situations like this. Try this out now by:\n", " - Uncomment the `time.sleep(3600)` line below, which sleeps for an hour.\n", " - Execute the cell and note the `In [*]` to the left while it is running.\n", " - Interrupt the kernel." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import time\n", "#time.sleep(3600)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In more extreme cases you may need to use the *Restart* command, which puts your kernel in a well-defined initial state (good) but also forgets any variables or functions you have already defined (bad). Try this now by:\n", " - Re-running the sleep cell above.\n", " - Restarting the kernel. Note that this leaves `[*]` next to the cell, which is misleading.\n", " - Comment out the `time.sleep(3600)` line above.\n", " - Rebuild the kernel state using *Run All Above* from the *Cell* menu. This fixes the `[*]`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Numerical Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This section introduces the \"numpy\" package. For further reading, see [Introduction to NumPy](https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html) in the [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Rationale" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The problem that numpy addresses is that python lists are very flexible, but not optimized for the special case where all list elements are numeric values of the same type, which can be very efficiently organized in memory. Numpy provides a specialized array with lots of nice features. One downside of this approach is that most of builtin math functions are duplicated (e.g., `math.sin` and `np.sin`) to work with numpy arrays." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we fill a list with values $\\sin(\\pi k / n)$ using plain python and the builtin math package (if the use of spaces in `math.pi * k / n` looks odd, see [PEP8](https://www.python.org/dev/peps/pep-0008/#whitespace-in-expressions-and-statements)):" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import math" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def create_python_array(n=100):\n", " x = []\n", " for k in range(n):\n", " x.append(math.sin(math.pi * k / n))\n", " return x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use an ipython [magic command](http://ipython.readthedocs.io/en/stable/interactive/magics.html) to time this function for a large array:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 262 ms, sys: 14.1 ms, total: 276 ms\n", "Wall time: 274 ms\n" ] } ], "source": [ "%time x1 = create_python_array(1000000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The resulting list is very flexible so, for example, we can replace any value with a completely different type or append new values:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "x1[3] = 'not a number'\n", "x1.append(-1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The corresponding python code uses numpy functions and does not require any python loop (which is often a bottleneck):" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def create_numpy_array(n=100):\n", " k = np.arange(n)\n", " x = np.sin(np.pi * k / n)\n", " return x" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 20.5 ms, sys: 9.14 ms, total: 29.6 ms\n", "Wall time: 27.9 ms\n" ] } ], "source": [ "%time x2 = create_numpy_array(1000000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The resulting numpy array holds only 64-bit floating point numbers and has a fixed size:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "could not convert string to float: 'not a number'\n" ] } ], "source": [ "try:\n", " x2[2] = 'not a number'\n", "except ValueError as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create Arrays" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most common ways you will create arrays are:\n", " - Filled with a regular sequence of values.\n", " - Filled with random values.\n", " - Calculated as a mathematical function of other arrays.\n", " \n", "The exercises below will give you some practice, with pointers to the `numpy functions` to use (Use the builtin notebook help to learn more about them). If you do not see a green \"Show Solution\" button below each exercise, check that you have installed and enabled the [exercise2 notebook extension](https://github.com/ipython-contrib/jupyter_contrib_nbextensions/tree/master/src/jupyter_contrib_nbextensions/nbextensions/exercise2)." ] }, { "cell_type": "markdown", "metadata": { "solution2": "hidden", "solution2_first": true }, "source": [ "**EXERCISE:** Use `np.arange` to create an array of integers 0, 1, ..., 9:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "solution2": "hidden" }, "outputs": [ { "data": { "text/plain": [ "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.arange(10)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Add your solution here..." ] }, { "cell_type": "markdown", "metadata": { "solution2": "hidden", "solution2_first": true }, "source": [ "**EXERCISE:** Use `np.linspace` to create an array of 11 floats 0.0, 0.1, ..., 0.9, 1.0:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "solution2": "hidden" }, "outputs": [ { "data": { "text/plain": [ "array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.linspace(0., 1., 11)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# Add your solution here..." ] }, { "cell_type": "markdown", "metadata": { "solution2": "hidden", "solution2_first": true }, "source": [ "**EXERCISE:** Use `np.random.RandomState` to initialize a reproducible (pseudo)random number generator using the seed 123:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "solution2": "hidden" }, "outputs": [], "source": [ "generator = np.random.RandomState(seed=123)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "# Add your solution here..." ] }, { "cell_type": "markdown", "metadata": { "solution2": "hidden", "solution2_first": true }, "source": [ "**EXERCISE:** Use `np.random.uniform` to generate an array of 10 uniformly distributed random numbers in the interval [0,1]:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "solution2": "hidden" }, "outputs": [ { "data": { "text/plain": [ "array([0.69646919, 0.28613933, 0.22685145, 0.55131477, 0.71946897,\n", " 0.42310646, 0.9807642 , 0.68482974, 0.4809319 , 0.39211752])" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "generator.uniform(size=10)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# Add your solution here..." ] }, { "cell_type": "markdown", "metadata": { "solution2": "hidden", "solution2_first": true }, "source": [ "**EXERCISE:** Use `np.random.normal` to generate an array of 10 random numbers drawn from a [normal (aka Gaussian) distribution](https://en.wikipedia.org/wiki/Normal_distribution) with mean -1 and [standard deviation](https://en.wikipedia.org/wiki/Standard_deviation) 2:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "solution2": "hidden" }, "outputs": [ { "data": { "text/plain": [ "array([ 1.53187252, -2.7334808 , -2.3577723 , -1.18941794, 1.98277925,\n", " -2.27780399, -1.88796392, -1.86870255, 3.41186017, 3.37357218])" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "generator.normal(loc=-1, scale=2, size=10)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# Add your solution here..." ] }, { "cell_type": "markdown", "metadata": { "solution2": "hidden", "solution2_first": true }, "source": [ "**EXERCISE:** Generate arrays x, y, z of length 10 with values drawn from a \"unit\" Gaussian distribution (mean=0, std=1):" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "solution2": "hidden" }, "outputs": [], "source": [ "x = generator.normal(size=10)\n", "y = generator.normal(size=10)\n", "z = generator.normal(size=10)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# Add your solution here..." ] }, { "cell_type": "markdown", "metadata": { "solution2": "hidden", "solution2_first": true }, "source": [ "**EXERCISE:** Adjust the values in x, y, z so that the vectors (x,y,z) are normalized. (This is a useful trick for generating random points that are uniformly distributed on a unit sphere)." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "solution2": "hidden" }, "outputs": [], "source": [ "r = np.sqrt(x ** 2 + y ** 2 + z ** 2)\n", "x /= r\n", "y /= r\n", "z /= r" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "# Add your solution here..." ] }, { "cell_type": "markdown", "metadata": { "solution2": "hidden", "solution2_first": true }, "source": [ "**EXERCISE:** Calculate arrays `theta` and `phi` containing the polar angles **in degrees** for each vector (x, y, z). Use `np.arccos`, `np.arctan2`, `np.degrees`." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "solution2": "hidden" }, "outputs": [], "source": [ "theta = np.degrees(np.arccos(z))\n", "phi = np.degrees(np.arctan2(y, x))" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "# Add your solution here..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multidimensional Arrays" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The examples above all create 1D arrays, but numpy arrays can have multiple dimensions (aka \"shapes\"):" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "vector = np.ones(shape=(4,))\n", "matrix = np.ones(shape=(3, 4))\n", "tensor = np.ones(shape=(2, 3, 4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Access array elements with zero-based indexing:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1.0" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector[1] # single element" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1., 1., 1., 1.])" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matrix[1, :] # single row" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1., 1., 1.])" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matrix[:, 1] # single column" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1., 1.],\n", " [1., 1.]])" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tensor[0, :2, :2] # submatrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `np.dot` to calculate inner products:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4.0" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector.dot(vector)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([4., 4., 4.])" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matrix.dot(vector)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Arrays of the same shape can be combined in \"vectorized\" operations (i.e., without any explicit loops):" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1.84147098, 1.84147098, 1.84147098, 1.84147098],\n", " [1.84147098, 1.84147098, 1.84147098, 1.84147098],\n", " [1.84147098, 1.84147098, 1.84147098, 1.84147098]])" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matrix + np.sin(matrix ** 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Arrays of different shapes can also be combined when they are compatible according to the [broadcasting rules](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html). Note that the broadcasting rules do not depend on the operations being used so, for example, if `x + y` is valid, then so is `x * y ** 2 + x ** 2 * y`." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1.1, 1.1, 1.1, 1.1],\n", " [1.1, 1.1, 1.1, 1.1],\n", " [1.1, 1.1, 1.1, 1.1]])" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matrix + 0.1" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[2., 2., 2., 2.],\n", " [2., 2., 2., 2.],\n", " [2., 2., 2., 2.]])" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vector + matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define a function to try broadcasting two arbitrary shapes:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "def broadcast(shape1, shape2):\n", " array1 = np.ones(shape1)\n", " array2 = np.ones(shape2)\n", " try:\n", " array12 = array1 + array2\n", " print('shapes {} {} broadcast to {}'.format(shape1, shape2, array12.shape))\n", " except ValueError as e:\n", " print(e)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "shapes (1, 3) (3,) broadcast to (1, 3)\n" ] } ], "source": [ "broadcast((1, 3), (3,))" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "operands could not be broadcast together with shapes (1,2) (3,) \n" ] } ], "source": [ "broadcast((1, 2), (3,))" ] }, { "cell_type": "markdown", "metadata": { "solution2": "hidden", "solution2_first": true }, "source": [ "**EXERCISE:** Predict the results of the following:\n", "```\n", "broadcast((3, 1, 2), (3, 2))\n", "broadcast((2, 1, 3), (3, 2))\n", "broadcast((3,), (2, 1))\n", "broadcast((3,), (1, 2))\n", "broadcast((3,), (1, 3))\n", "```" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "solution2": "hidden" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "shapes (3, 1, 2) (3, 2) broadcast to (3, 3, 2)\n", "operands could not be broadcast together with shapes (2,1,3) (3,2) \n", "shapes (3,) (2, 1) broadcast to (2, 3)\n", "operands could not be broadcast together with shapes (3,) (1,2) \n", "shapes (3,) (1, 3) broadcast to (1, 3)\n" ] } ], "source": [ "broadcast((3, 1, 2), (3, 2))\n", "broadcast((2, 1, 3), (3, 2))\n", "broadcast((3,), (2, 1))\n", "broadcast((3,), (1, 2))\n", "broadcast((3,), (1, 3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since array values are always \"unrolled\" into physical memory, arrays with different shapes can look identical in memory. Numpy takes advantage of this with \"views\" that allow the same memory to be very efficiently interpreted with different shapes:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `np.newaxis` method is useful to increase the dimensionality of an array, especially when used in combination with numpy's broadcast functionality." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(4,)" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.arange(4).shape" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1, 4)" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# make it as row vector by inserting an axis along first dimension\n", "np.arange(4)[np.newaxis, :].shape" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(4, 1)" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# make it as column vector by inserting an axis along second dimension\n", "np.arange(4)[:, np.newaxis].shape" ] }, { "cell_type": "markdown", "metadata": { "solution2": "hidden", "solution2_first": true }, "source": [ "**EXERCISE:** Use `np.reshape` to create a view of `np.arange(12)` with shape (2, 1, 6):" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "solution2": "hidden" }, "outputs": [ { "data": { "text/plain": [ "array([[[ 0, 1, 2, 3, 4, 5]],\n", "\n", " [[ 6, 7, 8, 9, 10, 11]]])" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.arange(12).reshape(2, 1, 6)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [], "source": [ "# Add your solution here..." ] }, { "cell_type": "markdown", "metadata": { "solution2": "hidden", "solution2_first": true }, "source": [ "**EXERCISE:** Use `np.newaxis` to create a view of `np.arange(12)` with shape (1, 12):" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "solution2": "hidden" }, "outputs": [ { "data": { "text/plain": [ "array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]])" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.arange(12)[np.newaxis, :]" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "# Add your solution here..." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }