{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ISL Lab 2.3 Introduction to Statistical Computing with Python"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import scipy\n",
    "import scipy.stats\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.simplefilter('ignore',FutureWarning)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.3.1 Basic Commands"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.arange(6)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a = np.arange(6)\n",
    "b = a.reshape((2,3))\n",
    "b"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.sqrt(b)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rnorm = scipy.stats.norm(loc=0,scale=1)  # mean =  loc = 0, standard_deviation = scale = 1\n",
    "x = rnorm.rvs(size=50)\n",
    "\n",
    "err = scipy.stats.norm(loc=50, scale=0.1)\n",
    "y = err.rvs(size=50)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.corrcoef(x,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(1303)\n",
    "rnorm.rvs(size=8)\n",
    "# Notice - same random numbers all of the time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.random.seed(3)\n",
    "y = rnorm.rvs(size=100)\n",
    "np.mean(y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.var(y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.sqrt(np.var(y))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.std(y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.3.2 Graphics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = rnorm.rvs(size=100)\n",
    "y = rnorm.rvs(size=100)\n",
    "ax = sns.scatterplot(x,y);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Have to dig back into MatPlotLib to set axis labels, so all is not perfect."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = sns.scatterplot(x,y);\n",
    "ax.set(xlabel=\"the x-axis\",ylabel=\"the y-axis\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Adding a title is a little more annoying, per [Stack Overflow explanation of adding a title to a Seaborn plot](https://stackoverflow.com/questions/42406233/how-to-add-title-to-seaborn-boxplot). There are [more complex explanations](https://stackoverflow.com/questions/29813694/how-to-add-a-title-to-seaborn-facet-plot#29814281) that work with multiple subplots."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = sns.scatterplot(x,y);\n",
    "ax.set_xlabel('independent var')\n",
    "ax.set_ylabel('dependent var')\n",
    "ax.set_title('Massive Title')\n",
    "plt.show();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Saving an image to a file](https://stackoverflow.com/questions/9622163/save-plot-to-image-file-instead-of-displaying-it-using-matplotlib#9890599) is also pretty straightforward using [savefig](https://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.savefig) from PyPlot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = sns.scatterplot(x,y);\n",
    "ax.set_title('Save this plot')\n",
    "plt.savefig('unlabeled-axes.png');\n",
    "# ugliness to avoid showing figure:\n",
    "fig = plt.gcf()\n",
    "plt.close(fig)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`np.linspace` makes equally spaced steps between the start and end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.linspace(-np.pi,np.pi,50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A contour plot needs a 2D array of z values (x,y) -> f(x,y).\n",
    "\n",
    "The hard part is getting the inputs to the function, or convincing f not to vectorize over x,y in parallel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.linspace(-np.pi,np.pi,50)\n",
    "y = x # for clarity only\n",
    "xx,yy = np.meshgrid(x,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def fbasic(x,y): return np.cos(y) / (1+x**2)\n",
    "f = np.vectorize(lambda x,y: np.cos(y) / (1+x**2))\n",
    "z = f(xx,yy)\n",
    "plt.contour(z);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.contour(z,45);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def g1(x,y): return (fbasic(x,y)+fbasic(y,x))/2\n",
    "g = np.vectorize(g1)\n",
    "z2 = g(xx,yy)\n",
    "plt.contour(z2,15);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`imshow` shows an image, like the `R` command `image`. \n",
    "Surely there is a way to get the coordinates input as well as the `z`, but in practice a regular grid seems most likely."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "randompix = np.random.random((16, 16))\n",
    "plt.imshow(randompix);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3D Rendering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from mpl_toolkits.mplot3d import Axes3D\n",
    "from matplotlib import cm\n",
    "from matplotlib.ticker import LinearLocator, FormatStrFormatter\n",
    "\n",
    "fig = plt.figure()\n",
    "ax = fig.add_subplot(111, projection='3d')\n",
    "\n",
    "surf = ax.plot_surface(xx,yy,z, cmap=cm.coolwarm);\n",
    "plt.show();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.3.3 Indexing Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a = np.arange(1,17).reshape((4,4)).T   # matches R example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(a)\n",
    "a[1,2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Beware if following code in the book. R indices start at 1, while Python indices start at 0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a[[0,2],[1,3]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a[[0,2],:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a[:,[1,3]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you combine the two in one set of brackets, they are traversed in parallel, getting you a[0,1] and a[2,3]."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a[[0,2],[1,3]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When you want a sub-array, index twice."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a[[0,2],:][:,[1,3]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `ix_` function makes grids out of indices that you give it. Clearer for this!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a[np.ix_([0,2],[1,3])]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note: R ranges include the last item, Python ranges do not."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a[np.ix_(np.arange(0,3),np.arange(1,4))]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a[[0,1],]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a[:,[0,1]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a[1,]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dropping columns is not as convenient in Python."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "b = np.delete(a,[0,2],0)\n",
    "b"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "c = np.delete(b,[0,2,3],1)\n",
    "c"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.3.4 Loading Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note**: To get the data from a preloaded R dataset, I do `write_table(the_data, filename=\"whatever\", sep=\"\\t\")` in R.\n",
    "\n",
    "Cool fact: `read_table` can load straight from a URL."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#auto = pd.read_table(\"Auto.data\")\n",
    "auto = pd.read_csv(\"http://www-bcf.usc.edu/~gareth/ISL/Auto.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get rid of any rows with missing data. This is *not* always a good idea."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "auto = auto.dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "auto.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "auto.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  2.3.5 Graphical and Numerical Summaries "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.scatterplot(auto['cylinders'], auto['mpg']);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.boxplot(x=\"cylinders\", y=\"mpg\", data=auto);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.stripplot(x=\"cylinders\", y=\"mpg\", data=auto);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.distplot(auto['mpg']);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.distplot(auto['mpg'],bins=15, kde=False, vertical=True);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.pairplot(data=auto);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.pairplot(data=auto[['mpg','displacement','horsepower',\n",
    "                        'weight','acceleration']]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I am not aware of a way to interactively identify points on a matplotlib plot that is similar to the R command `identify`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "auto.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "auto['name'].value_counts().head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "auto['mpg'].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Miscellaneous Notes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Categorical data can be constructed using `astype('category')` in Pandas. Read [more about categorical data](https://pandas.pydata.org/pandas-docs/stable/categorical.html) if you need the information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "auto['cylinders'] = auto['cylinders'].astype('category')\n",
    "auto['cylinders'].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Homework Starter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Easy access to [ISL datasets](http://www-bcf.usc.edu/~gareth/ISL/data.html) if you have internet access."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "college = pd.read_csv(\"http://www-bcf.usc.edu/~gareth/ISL/College.csv\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}