{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ISL Lab 2.3 Introduction to Statistical Computing with Python" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import scipy\n", "import scipy.stats\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.simplefilter('ignore',FutureWarning)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.3.1 Basic Commands" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.arange(6)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = np.arange(6)\n", "b = a.reshape((2,3))\n", "b" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.sqrt(b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rnorm = scipy.stats.norm(loc=0,scale=1) # mean = loc = 0, standard_deviation = scale = 1\n", "x = rnorm.rvs(size=50)\n", "\n", "err = scipy.stats.norm(loc=50, scale=0.1)\n", "y = err.rvs(size=50)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.corrcoef(x,y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.random.seed(1303)\n", "rnorm.rvs(size=8)\n", "# Notice - same random numbers all of the time" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.random.seed(3)\n", "y = rnorm.rvs(size=100)\n", "np.mean(y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.var(y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.sqrt(np.var(y))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.std(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.3.2 Graphics" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = rnorm.rvs(size=100)\n", "y = rnorm.rvs(size=100)\n", "ax = sns.scatterplot(x,y);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Have to dig back into MatPlotLib to set axis labels, so all is not perfect." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax = sns.scatterplot(x,y);\n", "ax.set(xlabel=\"the x-axis\",ylabel=\"the y-axis\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Adding a title is a little more annoying, per [Stack Overflow explanation of adding a title to a Seaborn plot](https://stackoverflow.com/questions/42406233/how-to-add-title-to-seaborn-boxplot). There are [more complex explanations](https://stackoverflow.com/questions/29813694/how-to-add-a-title-to-seaborn-facet-plot#29814281) that work with multiple subplots." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax = sns.scatterplot(x,y);\n", "ax.set_xlabel('independent var')\n", "ax.set_ylabel('dependent var')\n", "ax.set_title('Massive Title')\n", "plt.show();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Saving an image to a file](https://stackoverflow.com/questions/9622163/save-plot-to-image-file-instead-of-displaying-it-using-matplotlib#9890599) is also pretty straightforward using [savefig](https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.savefig) from PyPlot." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax = sns.scatterplot(x,y);\n", "ax.set_title('Save this plot')\n", "plt.savefig('unlabeled-axes.png');\n", "# ugliness to avoid showing figure:\n", "fig = plt.gcf()\n", "plt.close(fig)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`np.linspace` makes equally spaced steps between the start and end" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = np.linspace(-np.pi,np.pi,50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A contour plot needs a 2D array of z values (x,y) -> f(x,y).\n", "\n", "The hard part is getting the inputs to the function, or convincing f not to vectorize over x,y in parallel." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = np.linspace(-np.pi,np.pi,50)\n", "y = x # for clarity only\n", "xx,yy = np.meshgrid(x,y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def fbasic(x,y): return np.cos(y) / (1+x**2)\n", "f = np.vectorize(lambda x,y: np.cos(y) / (1+x**2))\n", "z = f(xx,yy)\n", "plt.contour(z);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.contour(z,45);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def g1(x,y): return (fbasic(x,y)+fbasic(y,x))/2\n", "g = np.vectorize(g1)\n", "z2 = g(xx,yy)\n", "plt.contour(z2,15);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`imshow` shows an image, like the `R` command `image`. \n", "Surely there is a way to get the coordinates input as well as the `z`, but in practice a regular grid seems most likely." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "randompix = np.random.random((16, 16))\n", "plt.imshow(randompix);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3D Rendering" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from mpl_toolkits.mplot3d import Axes3D\n", "from matplotlib import cm\n", "from matplotlib.ticker import LinearLocator, FormatStrFormatter\n", "\n", "fig = plt.figure()\n", "ax = fig.add_subplot(111, projection='3d')\n", "\n", "surf = ax.plot_surface(xx,yy,z, cmap=cm.coolwarm);\n", "plt.show();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.3.3 Indexing Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = np.arange(1,17).reshape((4,4)).T # matches R example" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(a)\n", "a[1,2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Beware if following code in the book. R indices start at 1, while Python indices start at 0." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a[[0,2],[1,3]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a[[0,2],:]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a[:,[1,3]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you combine the two in one set of brackets, they are traversed in parallel, getting you a[0,1] and a[2,3]." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a[[0,2],[1,3]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When you want a sub-array, index twice." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a[[0,2],:][:,[1,3]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `ix_` function makes grids out of indices that you give it. Clearer for this!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a[np.ix_([0,2],[1,3])]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: R ranges include the last item, Python ranges do not." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a[np.ix_(np.arange(0,3),np.arange(1,4))]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a[[0,1],]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a[:,[0,1]]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a[1,]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dropping columns is not as convenient in Python." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "b = np.delete(a,[0,2],0)\n", "b" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "c = np.delete(b,[0,2,3],1)\n", "c" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.3.4 Loading Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note**: To get the data from a preloaded R dataset, I do `write_table(the_data, filename=\"whatever\", sep=\"\\t\")` in R.\n", "\n", "Cool fact: `read_table` can load straight from a URL." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#auto = pd.read_table(\"Auto.data\")\n", "auto = pd.read_csv(\"http://www-bcf.usc.edu/~gareth/ISL/Auto.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get rid of any rows with missing data. This is *not* always a good idea." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "auto = auto.dropna()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "auto.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "auto.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.3.5 Graphical and Numerical Summaries " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.scatterplot(auto['cylinders'], auto['mpg']);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.boxplot(x=\"cylinders\", y=\"mpg\", data=auto);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.stripplot(x=\"cylinders\", y=\"mpg\", data=auto);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.distplot(auto['mpg']);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.distplot(auto['mpg'],bins=15, kde=False, vertical=True);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.pairplot(data=auto);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.pairplot(data=auto[['mpg','displacement','horsepower',\n", " 'weight','acceleration']]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I am not aware of a way to interactively identify points on a matplotlib plot that is similar to the R command `identify`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "auto.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "auto['name'].value_counts().head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "auto['mpg'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Miscellaneous Notes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Categorical data can be constructed using `astype('category')` in Pandas. Read [more about categorical data](https://pandas.pydata.org/pandas-docs/stable/categorical.html) if you need the information." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "auto['cylinders'] = auto['cylinders'].astype('category')\n", "auto['cylinders'].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Homework Starter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Easy access to [ISL datasets](http://www-bcf.usc.edu/~gareth/ISL/data.html) if you have internet access." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "college = pd.read_csv(\"http://www-bcf.usc.edu/~gareth/ISL/College.csv\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }