{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial: NumPy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "__author__ = \"Christopher Potts, Will Monroe, and Lucy Li\"\n", "__version__ = \"CS224u, Stanford, Spring 2020\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Contents\n", "\n", "1. [Motivation](#Motivation)\n", "1. [Vectors](#Vectors)\n", " 1. [Vector Initialization](#Vector-Initialization)\n", " 1. [Vector indexing](#Vector-indexing)\n", " 1. [Vector assignment](#Vector-assignment)\n", " 1. [Vectorized operations](#Vectorized-operations)\n", " 1. [Comparison with Python lists](#Comparison-with-Python-lists)\n", "1. [Matrices](#Matrices)\n", " 1. [Matrix initialization](#Matrix-initialization)\n", " 1. [Matrix indexing](#Matrix-indexing)\n", " 1. [Matrix assignment](#Matrix-assignment)\n", " 1. [Matrix reshaping](#Matrix-reshaping)\n", " 1. [Numeric operations](#Numeric-operations)\n", "1. [Practical example: a shallow neural network](#Practical-example:-a-shallow-neural-network)\n", "1. [Going beyond NumPy alone](#Going-beyond-NumPy-alone)\n", " 1. [Pandas](#Pandas)\n", " 1. [Scikit-learn](#Scikit-learn)\n", " 1. [SciPy](#SciPy)\n", " 1. [Matplotlib](#Matplotlib)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Motivation\n", "\n", "Why should we care about NumPy? \n", "\n", "- It allows you to perform tons of operations on vectors and matrices. \n", "- It makes things run faster than naive for-loop implementations (a.k.a. vectorization). \n", "- We use it in our class (see files prefixed with `np_` in your `cs224u` directory). \n", "- It's used a ton in machine learning / AI. \n", "- Its arrays are often inputs into other important Python packages' functions. \n", "\n", "In Jupyter notebooks, NumPy documentation is two clicks away: Help -> NumPy reference. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Vectors" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vector Initialization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.zeros(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.ones(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# convert list to numpy array\n", "np.array([1,2,3,4,5])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# convert numpy array to list\n", "np.ones(5).tolist()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# one float => all floats\n", "np.array([1.0,2,3,4,5])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# same as above\n", "np.array([1,2,3,4,5], dtype='float')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# spaced values in interval\n", "np.array([x for x in range(20) if x % 2 == 0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# same as above\n", "np.arange(0,20,2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# random floats in [0, 1)\n", "np.random.random(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# random integers\n", "np.random.randint(5, 15, size=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vector indexing" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = np.array([10,20,30,40,50])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# slice\n", "x[0:2]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x[0:1000]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# last value\n", "x[-1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# last value as array\n", "x[[-1]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# last 3 values\n", "x[-3:]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# pick indices\n", "x[[0,2,4]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vector assignment\n", "\n", "Be careful when assigning arrays to new variables! " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#x2 = x # try this line instead\n", "x2 = x.copy()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x2[0] = 10\n", "\n", "x2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x2[[1,2]] = 10\n", "\n", "x2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x2[[3,4]] = [0, 1]\n", "\n", "x2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# check if the original vector changed\n", "x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Vectorized operations" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x.sum()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x.mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x.max()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x.argmax()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.log(x)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.exp(x)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x + x # Try also with *, -, /, etc." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x + 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Comparison with Python lists\n", "\n", "Vectorizing your mathematical expressions can lead to __huge__ performance gains. The following example is meant to give you a sense for this. It compares applying `np.log` to each element of a list with 10 million values with the same operation done on a vector." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# log every value as list, one by one\n", "def listlog(vals):\n", " return [np.log(y) for y in vals]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get random vector\n", "samp = np.random.random_sample(int(1e7))+1\n", "samp" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%time _ = np.log(samp)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%time _ = listlog(samp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Matrices\n", "\n", "The matrix is the core object of machine learning implementations. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matrix initialization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.array([[1,2,3], [4,5,6]])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.array([[1,2,3], [4,5,6]], dtype='float')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.zeros((3,5))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.ones((3,5))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.identity(3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.diag([1,2,3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matrix indexing" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = np.array([[1,2,3], [4,5,6]])\n", "X" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X[0,0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get row\n", "X[0, : ]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get column\n", "X[ : , 0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get multiple columns\n", "X[ : , [0,2]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matrix assignment" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# X2 = X # try this line instead\n", "X2 = X.copy()\n", "\n", "X2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X2[0,0] = 20\n", "\n", "X2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X2[0] = 3\n", "\n", "X2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X2[: , -1] = [5, 6]\n", "\n", "X2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# check if original matrix changed\n", "X" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matrix reshaping" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "z = np.arange(1, 7)\n", "\n", "z" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "z.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Z = z.reshape(2,3)\n", "\n", "Z" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Z.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Z.reshape(6)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# same as above\n", "Z.flatten()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# transpose\n", "Z.T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Numeric operations" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "A = np.array(range(1,7), dtype='float').reshape(2,3)\n", "\n", "A" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "B = np.array([1, 2, 3])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# not the same as A.dot(B)\n", "A * B" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "A + B" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "A / B" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# matrix multiplication\n", "A.dot(B)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "B.dot(A.T)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "A.dot(A.T)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# outer product\n", "# multiplying each element of first vector by each element of the second\n", "np.outer(B, B)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Practical example: a shallow neural network" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following is a practical example of numerical operations on NumPy matrices. \n", "\n", "In our class, we have a shallow neural network implemented in `np_shallow_neural_network.py`. See how the forward and backward passes use no for loops, and instead takes advantage of NumPy's ability to vectorize manipulations of data. \n", "\n", "```python\n", "def forward_propagation(self, x):\n", " h = self.hidden_activation(x.dot(self.W_xh) + self.b_xh)\n", " y = softmax(h.dot(self.W_hy) + self.b_hy)\n", " return h, y\n", "\n", "def backward_propagation(self, h, predictions, x, labels):\n", " y_err = predictions.copy()\n", " y_err[np.argmax(labels)] -= 1\n", " d_b_hy = y_err\n", " h_err = y_err.dot(self.W_hy.T) * self.d_hidden_activation(h)\n", " d_W_hy = np.outer(h, y_err)\n", " d_W_xh = np.outer(x, h_err)\n", " d_b_xh = h_err\n", " return d_W_hy, d_b_hy, d_W_xh, d_b_xh\n", "```\n", "\n", "The forward pass essentially computes the following: \n", " $$h = f(xW_{xh} + b_{xh})$$\n", " $$y = \\text{softmax}(hW_{hy} + b_{hy}),$$\n", "where $f$ is `self.hidden_activation`. \n", "\n", "The backward pass propagates error by computing local gradients and chaining them. Feel free to learn more about backprop [here](http://cs231n.github.io/optimization-2/), though it is not necessary for our class. Also look at this [neural networks case study](http://cs231n.github.io/neural-networks-case-study/) to see another example of how NumPy can be used to implement forward and backward passes of a simple neural network. " ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Going beyond NumPy alone\n", "\n", "These are examples of how NumPy can be used with other Python packages. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pandas\n", "We can convert numpy matrices to Pandas dataframes. In the following example, this is useful because it allows us to label each row. You may have noticed this being done in our first unit on distributed representations. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "count_df = pd.DataFrame(\n", " np.array([\n", " [1,0,1,0,0,0],\n", " [0,1,0,1,0,0],\n", " [1,1,1,1,0,0],\n", " [0,0,0,0,1,1],\n", " [0,0,0,0,0,1]], dtype='float64'),\n", " index=['gnarly', 'wicked', 'awesome', 'lame', 'terrible'])\n", "count_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scikit-learn\n", "\n", "In `sklearn`, NumPy matrices are the most common input and output and thus a key to how the library's numerous methods can work together. Many of the cs224u's model built by Chris operate just like `sklearn` ones, such as the classifiers we used for our sentiment analysis unit. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import classification_report\n", "from sklearn import datasets" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iris = datasets.load_iris()\n", "X = iris.data\n", "y = iris.target\n", "print(type(X))\n", "print(\"Dimensions of X:\", X.shape)\n", "print(type(y))\n", "print(\"Dimensions of y:\", y.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# split data into train/test\n", "X_iris_train, X_iris_test, y_iris_train, y_iris_test = train_test_split(\n", " X, y, train_size=0.7, test_size=0.3)\n", "print(\"X_iris_train:\", type(X_iris_train))\n", "print(\"y_iris_train:\", type(y_iris_train))\n", "print()\n", "\n", "# start up model\n", "maxent = LogisticRegression(\n", " fit_intercept=True, \n", " solver='liblinear', \n", " multi_class='auto')\n", "\n", "# train on train set\n", "maxent.fit(X_iris_train, y_iris_train)\n", "\n", "# predict on test set\n", "iris_predictions = maxent.predict(X_iris_test)\n", "fnames_iris = iris['feature_names']\n", "tnames_iris = iris['target_names']\n", "\n", "# how well did our model do?\n", "print(classification_report(y_iris_test, iris_predictions, target_names=tnames_iris))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SciPy\n", "\n", "SciPy contains what may seem like an endless treasure trove of operations for linear algebra, optimization, and more. It is built so that everything can work with NumPy arrays. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy.spatial.distance import cosine\n", "from scipy.stats import pearsonr\n", "from scipy import linalg" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# cosine distance\n", "a = np.random.random(10)\n", "b = np.random.random(10)\n", "cosine(a, b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# pearson correlation (coeff, p-value)\n", "pearsonr(a, b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# inverse of matrix\n", "A = np.array([[1,3,5],[2,5,1],[2,3,8]])\n", "linalg.inv(A)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To learn more about how NumPy can be combined with SciPy and Scikit-learn for machine learning, check out this [notebook tutorial](https://github.com/cgpotts/csli-summer/blob/master/advanced_python/intro_to_python_ml.ipynb) by Chris Potts and Will Monroe. (You may notice that over half of this current notebook is modified from theirs.) Their tutorial also has some interesting exercises in it! " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matplotlib" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = np.sort(np.random.random(30))\n", "b = a**2\n", "c = np.log(a)\n", "plt.plot(a, b, label='y = x^2')\n", "plt.plot(a, c, label='y = log(x)')\n", "plt.legend()\n", "plt.title(\"Some functions\")\n", "plt.show()" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }