{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lab 7: Regression\n",
    "\n",
    "Welcome to Lab 7!\n",
    "\n",
    "Today we will get some hands-on practice with linear regression. You can find more information about this topic in\n",
    "[section 15.2](https://www.inferentialthinking.com/chapters/15/2/Regression_Line.html#the-regression-line)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Run this cell, but please don't change it.\n",
    "\n",
    "# These lines import the Numpy and Datascience modules.\n",
    "import numpy as np\n",
    "from datascience import *\n",
    "\n",
    "# These lines do some fancy plotting magic.\n",
    "import matplotlib\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plots\n",
    "plots.style.use('fivethirtyeight')\n",
    "import warnings\n",
    "warnings.simplefilter('ignore', FutureWarning)\n",
    "\n",
    "from scipy.stats import pearsonr as cor\n",
    "\n",
    "import scipy.stats as stats\n",
    "\n",
    "import scipy.stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. How Faithful is Old Faithful? \n",
    "\n",
    "Old Faithful is a geyser in Yellowstone National Park that is famous for eruption on a fairly regular schedule. Run the cell below to see Old Faithful in action!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# For the curious: this is how to display a YouTube video in a\n",
    "# Jupyter notebook.  The argument to YouTubeVideo is the part\n",
    "# of the URL (called a \"query parameter\") that identifies the\n",
    "# video.  For example, the full URL for this video is:\n",
    "#   https://www.youtube.com/watch?v=wE8NDuzt8eg\n",
    "from IPython.display import YouTubeVideo\n",
    "YouTubeVideo(\"wE8NDuzt8eg\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some of Old Faithful's eruptions last longer than others.  Whenever there is a long eruption, it is usually followed by an even longer wait before the next eruption. If you visit Yellowstone, you might want to predict when the next eruption will happen, so that you can see the rest of the park instead of waiting by the geyser.\n",
    " \n",
    "Today, we will use a dataset on eruption durations and waiting times to see if we can make such predictions accurately with linear regression.\n",
    "\n",
    "The dataset has one row for each observed eruption.  It includes the following columns:\n",
    "- `duration`: Eruption duration, in minutes\n",
    "- `wait`: Time between this eruption and the next, also in minutes\n",
    "\n",
    "Run the next cell to load the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "faithful = Table.read_table(\"faithful.csv\")\n",
    "faithful"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We would like to use linear regression to make predictions, but that won't work well if the data aren't roughly linearly related.  To check that, we should look at the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false
   },
   "source": [
    "**Question 1.1.** Make a scatter plot of the data.  It's conventional to put the column we want to predict on the vertical axis and the other column on the horizontal axis.  Add fit_line<span style =\"color:purple\">=</span> <span style=\"color:green; font-family:monospace; font-weight: bold\">  True </span> to make the regression line appear on the plot.   \n",
    "\n",
    "<!--\n",
    "BEGIN QUESTION\n",
    "name: q1_1\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "**Question 1.2.** Are eruption duration and waiting time roughly linearly related based on the scatter plot above? Is this relationship positive?\n",
    "\n",
    "<!--\n",
    "BEGIN QUESTION\n",
    "name: q1_2\n",
    "-->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Write your answer here, replacing this text.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're going to continue with the assumption that they are linearly related, so it's reasonable to use linear regression to analyze this data.\n",
    "\n",
    "We'd next like to plot the data in standard units. If you don't remember the definition of standard units, textbook section [14.2](https://www.inferentialthinking.com/chapters/14/2/Variability.html#standard-units) might help!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false
   },
   "source": [
    "**Question 1.3.** Use the means and standard deviations provided and convert both the eruption durations and waiting times into standard units.  **Then** create a table called `faithful_standard` containing the eruption durations and waiting times in standard units.  The columns should be named `duration (standard units)` and `wait (standard units)`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$SU = \\frac{x - \\overline{x}}{S_x}$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "for_assignment_type": "solution"
   },
   "outputs": [],
   "source": [
    "import scipy.stats as stats\n",
    "\n",
    "duration_mean = np.mean(faithful.column(\"duration\"))\n",
    "duration_std = stats.tstd(faithful.column(\"duration\"))\n",
    "wait_mean = np.mean(faithful.column(\"wait\"))\n",
    "wait_std = stats.tstd(faithful.column(\"wait\"))\n",
    "\n",
    "duration_in_su = ...\n",
    "wait_in_su = ...\n",
    "\n",
    "faithful_standard = Table().with_columns(\"duration (standard units)\",\n",
    "                                        ...,\n",
    "                                        \"wait (standard units)\",\n",
    "                                        ...)\n",
    "\n",
    "faithful_standard"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "**Question 1.4.** Plot the data again, but this time in standard units.\n",
    "\n",
    "<!--\n",
    "BEGIN QUESTION\n",
    "name: q1_4\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You'll notice that this plot looks the same as the last one!  However, the data and axes are scaled differently.  So it's important to read the ticks on the axes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false
   },
   "source": [
    "Recall that $$r = \\frac{1}{n-1} \\sum_{i=1}^n\\left(\\frac{x_i-\\overline{x}}{S_x}\\right)\\left(\\frac{y_i-\\overline{y}}{S_y}\\right)$$\n",
    "\n",
    "So **if** x and y are in standard unit, then $$r = \\frac{1}{n-1} \\sum_{i=1}^nx_iy_i$$\n",
    "\n",
    "**Question 1.5.** Use this fact to compute the correlation `r` from the faithful_standard table.    \n",
    "\n",
    "*Hint:* Use `faithful_standard`.  Section [15.1](https://www.inferentialthinking.com/chapters/15/1/Correlation.html#calculating-r) explains how to do this.\n",
    "\n",
    "<!--\n",
    "BEGIN QUESTION\n",
    "name: q1_5\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "correlation = ...\n",
    "correlation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false
   },
   "source": [
    "**Question 1.6.** Use the function `cor` (which was imported from the `scipy.stats` module) to compute the correlation between wait and duration.  \n",
    "\n",
    "\n",
    "\n",
    "<!--\n",
    "BEGIN QUESTION\n",
    "name: q1_6\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "r = ...\n",
    "r"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the `pearsonr` (or `cor`) returns two values.  The second number is the *p*-value for a hypothesis test.  What hypothesis test?  This is an aside, but here's the details.  \n",
    "\n",
    "$H_o: \\rho = 0$  The correlation is 0.\n",
    "\n",
    "$H_a: \\rho \\not= 0$  The correlation is not 0.\n",
    "\n",
    "If the *p*-value is small (smaller than $\\alpha$), then we reject the null hypothesis, meaning we conclude that the correlation is NOT 0.  This means that there is *some* relationship between the two variables.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. The regression line\n",
    "\n",
    "Recall that the Root Mean Squared Error is \n",
    "\n",
    "$$RMSE = \\sqrt{MSE} = \\sqrt{\\frac{1}{n-2} SSE} = \\sqrt{\\frac{1}{n-2} \\sum_{i = 1}^n (y_i - \\hat{y}_i)^2}$$\n",
    "\n",
    "The linear regression line is the line that has the least possible value of the RMSE.\n",
    "\n",
    "Also, recall that the correlation is the **slope of the regression line when the data are put in standard units**.\n",
    "\n",
    "The next cell shows a scatterplot in standard units:\n",
    "\n",
    "$$\\text{waiting time in standard units} = r \\times \\text{eruption duration in standard units}.$$\n",
    "\n",
    "Use the sliders to adjust the line shown.  Find the slope/correlation by trying to find the smallest value of the RMSE, which is also displayed on the graph.  (Adjust the slope and intercept slowly, there is a lag between when you move the sliders and when the RMSE is updated.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y1 = faithful.column('wait')\n",
    "x1 = faithful.column('duration')\n",
    "\n",
    "xx = (x1-np.average(x1))/stats.tstd(x1)\n",
    "yy = (y1-np.average(y1))/stats.tstd(y1)\n",
    "\n",
    "import ipywidgets as widgets\n",
    "\n",
    "from ipywidgets import interact, interactive, fixed, interact_manual\n",
    "import ipywidgets as widgets\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "def plot_func(correlation, intercept):\n",
    "    x = np.linspace(-3.5, 3.5)\n",
    "    y = x*correlation + intercept\n",
    "    plt.scatter(xx,yy, s=50, color=\"green\")\n",
    "    plt.ylim(-2.5,2.5)\n",
    "    plt.xlim(-2,2) \n",
    "    rmse = np.round((sum((yy - correlation*xx - intercept)**2)/(len(xx)-2))**0.5,2)\n",
    "    plt.plot(x, y)\n",
    "    plt.text(-1.5,1, f\"RMSE = {rmse}\", color=\"red\", size='large')\n",
    "    plt.ylabel(\"Duration\")\n",
    "    plt.xlabel(\"Wait\")\n",
    "\n",
    "interact(plot_func, correlation = widgets.FloatSlider(value=0.9001, min=0, max=1.0, step=0.01), \n",
    "\n",
    "         intercept=widgets.FloatSlider(value=0, min=-2, max=2, step=0.050));\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 2.1** What is the least value of the RMSE?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "least_rmse_standard_units = ...\n",
    "least_rmse_standard_units"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false
   },
   "source": [
    "How would you take a point in standard units and convert it back to original units?  We'd have to \"stretch\" its horizontal position by `duration_std` and its vertical position by `wait_std`. That means the same thing would happen to the slope of the line.\n",
    "\n",
    "Stretching a line horizontally makes it less steep, so we divide the slope by the stretching factor.  Stretching a line vertically makes it more steep, so we multiply the slope by the stretching factor.\n",
    "\n",
    "That is the slope of a regression line is: $\\hat{\\beta}_1 = r \\frac{S_y}{S_x}$\n",
    "\n",
    "**Question 2.2.** Calculate the slope of the regression line in original units, and assign it to `slope`.\n",
    "\n",
    "(If the \"stretching\" explanation is unintuitive, consult section [15.2](https://www.inferentialthinking.com/chapters/15/2/Regression_Line.html#the-equation-of-the-regression-line) in the textbook.)\n",
    "\n",
    "<!--\n",
    "BEGIN QUESTION\n",
    "name: q2_1\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "slope = ...\n",
    "slope"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "We know that the regression line passes through the point `(duration_mean, wait_mean)`.  You might recall from high-school algebra that the equation for the line is therefore:\n",
    "\n",
    "$$\\text{waiting time} - \\verb|wait_mean| = \\texttt{slope} \\times (\\text{eruption duration} - \\verb|duration_mean|)$$\n",
    "\n",
    "The rearranged equation becomes:\n",
    "\n",
    "$$\\text{waiting time} = \\texttt{slope} \\times \\text{eruption duration} + (- \\texttt{slope} \\times \\verb|duration_mean| + \\verb|wait_mean|)$$\n",
    "\n",
    "\n",
    "**Question 2.2.** Calculate the intercept in original units and assign it to `intercept`.\n",
    "\n",
    "<!--\n",
    "BEGIN QUESTION\n",
    "name: q2_2\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "intercept = ...\n",
    "intercept"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "## 3. Investigating the regression line\n",
    "The slope and intercept tell you exactly what the regression line looks like.  To predict the waiting time for an eruption, multiply the eruption's duration by `slope` and then add `intercept`.\n",
    "\n",
    "**Question 3.1.** Compute the predicted waiting time for an eruption that lasts 2 minutes, and for an eruption that lasts 5 minutes.\n",
    "\n",
    "<!--\n",
    "BEGIN QUESTION\n",
    "name: q3_1\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "two_minute_predicted_waiting_time = ...\n",
    "five_minute_predicted_waiting_time = ...\n",
    "\n",
    "# Here is a helper function to print out your predictions.\n",
    "# Don't modify the code below.\n",
    "def print_prediction(duration, predicted_waiting_time):\n",
    "    print(\"After an eruption lasting\", duration,\n",
    "          \"minutes, we predict you'll wait\", predicted_waiting_time,\n",
    "          \"minutes until the next eruption.\")\n",
    "\n",
    "print_prediction(2, two_minute_predicted_waiting_time)\n",
    "print_prediction(5, five_minute_predicted_waiting_time)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next cell plots the line that goes between those two points, which is (a segment of) the regression line."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "faithful.scatter( \"duration\", \"wait\") \n",
    "plots.plot([2, 5], [two_minute_predicted_waiting_time, five_minute_predicted_waiting_time]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "**Question 3.2.** Make predictions for the waiting time after each eruption in the `faithful` table.  (Of course, we know exactly what the waiting times were!  We are doing this so we can see how accurate our predictions are.)  Put these numbers into a column in a new table called `faithful_predictions`.  Its first row should look like this:\n",
    "\n",
    "|duration|wait|predicted wait|\n",
    "|-|-|-|\n",
    "|3.6|79|72.1011|\n",
    "\n",
    "*Hint:* Your answer can be just one line.  There is no need for a `for` loop; use array arithmetic instead.\n",
    "\n",
    "<!--\n",
    "BEGIN QUESTION\n",
    "name: q3_2\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "faithful_predictions = \n",
    "\n",
    "faithful_predictions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "**Question 3.3.** How close were we?  Compute the *residual* for each eruption in the dataset.  The residual is the actual waiting time minus the predicted waiting time.  Add the residuals to `faithful_predictions` as a new column called `residual` and name the resulting table `faithful_residuals`.\n",
    "\n",
    "*Hint:* Again, your code will be much simpler if you don't use a `for` loop.\n",
    "\n",
    "<!--\n",
    "BEGIN QUESTION\n",
    "name: q3_3\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "faithful_residuals_array = ...\n",
    "faithful_residuals = ...\n",
    "faithful_residuals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is a plot of the residuals you computed.  Each point corresponds to one eruption.  It shows how much our prediction over- or under-estimated the waiting time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "faithful_residuals.scatter(\"duration\", \"residual\", color=\"r\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There isn't really a pattern in the residuals, which confirms that it was reasonable to try linear regression.  It's true that there are two separate clouds; the eruption durations seemed to fall into two distinct clusters.  But that's just a pattern in the eruption durations, not a pattern in the relationship between eruption durations and waiting times.\n",
    "\n",
    "**Question 3.4** Use the interact plot below to set the slope and intercept to the values that results in the least possible value for the RMSE.  In the cell below that, record this least possible RMSE value.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "yy = faithful.column('wait')\n",
    "xx = faithful.column('duration')\n",
    "\n",
    "import ipywidgets as widgets\n",
    "\n",
    "from ipywidgets import interact, interactive, fixed, interact_manual\n",
    "import ipywidgets as widgets\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "def plot_func(slope, intercept):\n",
    "    x = np.linspace(1, 6)\n",
    "    y = x*slope + intercept\n",
    "    plt.scatter(xx,yy, s=50, color=\"green\")\n",
    "    plt.ylim(40,100)\n",
    "    plt.xlim(1,6) \n",
    "    rmse = np.round((sum((yy - slope*xx - intercept)**2)/(len(xx)-2))**0.5,2)\n",
    "    plt.plot(x, y)\n",
    "    plt.text(1.5,90, f\"RMSE = {rmse}\", color=\"red\", size='large')\n",
    "    plt.xlabel(\"Duration\")\n",
    "    plt.ylabel(\"Wait\")\n",
    "\n",
    "interact(plot_func, slope = widgets.FloatSlider(value=10.73, min=5, max=15, step=0.1), \n",
    "\n",
    "         intercept=widgets.FloatSlider(value=33.47, min=20, max=40, step=0.1));\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "least_rmse_original_units = ...\n",
    "least_rmse_original_units"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. How accurate are different predictions?\n",
    "Earlier, you should have found that the correlation is fairly close to 1, so the line fits fairly well on the training data.  That means the residuals are overall small (close to 0) in comparison to the waiting times.\n",
    "\n",
    "We can see that visually by plotting the waiting times and residuals together:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "faithful_predictions.scatter(\"duration\", \"wait\", label=\"actual waiting time\", color=\"blue\")\n",
    "plots.scatter(faithful_predictions.column(\"duration\"), faithful_residuals.column(\"residual\"), label=\"residual\", color=\"r\")\n",
    "plots.plot([2, 5], [two_minute_predicted_waiting_time, five_minute_predicted_waiting_time], label=\"regression line\")\n",
    "plots.legend(bbox_to_anchor=(1.7,.8));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, unless you have a strong reason to believe that the linear regression model is true, you should be wary of applying your prediction model to data that are very different from the training data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "**Question 4.1.** In `faithful`, no eruption lasted exactly 0, 2.5, or 60 minutes.  Using this line, what is the predicted waiting time for an eruption that lasts 0 minutes?  2.5 minutes?  An hour?\n",
    "\n",
    "<!--\n",
    "BEGIN QUESTION\n",
    "name: q4_1\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "zero_minute_predicted_waiting_time = ...\n",
    "two_point_five_minute_predicted_waiting_time = ...\n",
    "hour_predicted_waiting_time = ...\n",
    "\n",
    "print_prediction(0, zero_minute_predicted_waiting_time)\n",
    "print_prediction(2.5, two_point_five_minute_predicted_waiting_time)\n",
    "print_prediction(60, hour_predicted_waiting_time)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "**Question 2.** For each prediction, state whether you think it's reliable and explain your reasoning. \n",
    "\n",
    "<!--\n",
    "BEGIN QUESTION\n",
    "name: q4_2\n",
    "-->"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Write your answer here, replacing this text.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Divide and Conquer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It appears from the scatter diagram that there are two clusters of points: one for durations around 2 and another for durations between 3.5 and 5. A vertical line at 3 divides the two clusters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "faithful.scatter(\"duration\", \"wait\", label=\"actual waiting time\", color=\"blue\")\n",
    "plots.plot([3, 3], [40, 100]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false
   },
   "source": [
    "**Question 1.** Separately compute the regression coefficients *r* for all the points with a duration below 3 **and then** for all the points with a duration above 3. To do so, first create two different tables of points, `below_3` and `above_3`, then compute the correlations.  \n",
    "\n",
    "<!--\n",
    "BEGIN QUESTION\n",
    "name: q5_1\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "below_3 = ...\n",
    "above_3 = ...\n",
    "below_3_r = ...\n",
    "above_3_r = ...\n",
    "\n",
    "print(\"For points below 3, r is\", below_3_r, \"; for points above 3, r is\", above_3_r)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false
   },
   "source": [
    "**Question 5.2.** Use the `linregress` function from `scipy.stats` to compute the slope and intercept for both the above_3 and below_3 datasets.  \n",
    "\n",
    "When you're done, record the slopes and intercepts in the cell below as  below_3_slope, below_3_intercept, above_3_slope and above_3_intercept.  \n",
    "\n",
    "<!--\n",
    "BEGIN QUESTION\n",
    "name: q5_2\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "below_3_slope = ...\n",
    "below_3_intercept = ...\n",
    "above_3_slope = ...\n",
    "above_3_intercept = ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The plot below shows two different regression lines, one for each cluster!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "faithful.scatter(0, 1)\n",
    "plots.plot([1, 3], [below_3_slope*1+below_3_intercept, below_3_slope*3+below_3_intercept])\n",
    "plots.plot([3, 6], [above_3_slope*3+above_3_intercept, above_3_slope*6+above_3_intercept]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 3.**  In the cell below, we provide a function called `predict_wait` that takes a `duration` and returns the predicted wait time using the appropriate regression line, depending on whether the duration is below 3 or greater than (or equal to) 3.  Use the `.apply` method and this new function to add the predicted wait times to the `faithful` table.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "for_assignment_type": "student"
   },
   "outputs": [],
   "source": [
    "def predict_wait(duration):\n",
    "    if duration <=3:\n",
    "        return below_3_slope*duration + below_3_intercept\n",
    "    else:\n",
    "        return above_3_slope*duration + above_3_intercept\n",
    "        \n",
    "predict_wait(2)   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The predicted wait times for each point appear below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "faithful.with_column('predicted', faithful.apply(..., ...)).scatter(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 4.** Do you think the predictions produced by `predict_wait` would be more or less accurate than the predictions from the regression line you created in section 2? How could you tell?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Write your answer here, replacing this text.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question 5.** Use the sliders to set the slopes and intercepts to the values from the linregress output.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y1 = faithful.where('duration', are.below_or_equal_to(3)).column('wait')\n",
    "x1 = faithful.where('duration', are.below_or_equal_to(3)).column('duration')\n",
    "\n",
    "y2 = faithful.where('duration', are.above(3)).column('wait')\n",
    "x2 = faithful.where('duration', are.above(3)).column('duration')\n",
    "\n",
    "yy = faithful.column('wait')\n",
    "xx = faithful.column('duration')\n",
    "\n",
    "import ipywidgets as widgets\n",
    "\n",
    "from ipywidgets import interact, interactive, fixed, interact_manual\n",
    "import ipywidgets as widgets\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "def predict_wait_func(duration, lo_slope, hi_slope, lo_int, hi_int):\n",
    "    if(duration <=3):\n",
    "        return lo_slope * duration + lo_int\n",
    "    else:\n",
    "        return hi_slope * duration + hi_int\n",
    "\n",
    "def plot_func(lo_slope, hi_slope, lo_int, hi_int):\n",
    "    x3 = np.linspace(0, 3)\n",
    "    y3 = make_array()\n",
    "    x4 = np.linspace(3.01, 6)\n",
    "    y4 = make_array()\n",
    "    for i in np.arange(len(x3)):\n",
    "        y3 =np.append(y3, predict_wait_func(x3.item(i),lo_slope, hi_slope, lo_int, hi_int))\n",
    "    for i in np.arange(len(x4)):\n",
    "        y4 =np.append(y4, predict_wait_func(x4.item(i),lo_slope, hi_slope, lo_int, hi_int))\n",
    "    plt.scatter(x1,y1, s=50, color=\"green\")\n",
    "    plt.scatter(x2,y2, s=50, color=\"blue\")\n",
    "    plt.ylim(40, 100)\n",
    "    plt.xlim(1, 5.5)\n",
    "    new_residuals = make_array()\n",
    "    for i in np.arange(len(xx)):\n",
    "        new_residuals = np.append(new_residuals, yy.item(i) - predict_wait_func(xx.item(i),lo_slope, hi_slope, lo_int, hi_int) )\n",
    "    rmse = np.round((sum(new_residuals**2)/(len(xx)-2))**0.5,2)\n",
    "    plt.plot(x3, y3, color=\"green\")\n",
    "    plt.plot(x4, y4, color=\"blue\")\n",
    "    plt.text(1.5, 90, f\"RMSE = {rmse}\", color=\"red\", size='large')\n",
    "    plt.xlabel(\"Duration\")\n",
    "    plt.ylabel(\"Wait\")\n",
    "\n",
    "interact(plot_func, lo_slope = widgets.FloatSlider(value=6.35, min=5, max=15, step=0.1), \n",
    "        hi_slope = widgets.FloatSlider(value=5.44, min=0, max=15, step=0.1),\n",
    "        lo_int=widgets.FloatSlider(value=41.6, min=20, max=50, step=0.1),\n",
    "        hi_int=widgets.FloatSlider(value=56.6, min=20, max=70, step=0.1));\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "split_rmse = ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Which RMSE is lower?  The one from the end of question 3.4 (least_rmse_original_units) or the one we just calculated (split_rmse)?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Summary #\n",
    "\n",
    "Run the cell below without changing anything in it.  It should produce a nice summary of what we've learned during this lab.  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"The mean duration of an eruption is \", np.round(duration_mean,1), \"minutes with a standard deviation of\", \n",
    "      np.round(duration_std,1),\".\" )\n",
    "print(\"The mean wait time between eruptions is \", np.round(wait_mean,1), \"minutes with a standard deviation of\", \n",
    "      np.round(wait_std,1),\".\" )\n",
    "print(\"These two variables have a correlation of\", np.round(correlation,3), \".\")\n",
    "\n",
    "print(\"A single regression analysis has RMSE =\", least_rmse_original_units,\". \\nThat model is predicted wait = \", \n",
    "      np.round(slope, 2), \"duration + \", np.round(intercept,2), \".\")\n",
    "print(\"Splitting the data at durations of 3, we get an RMSE = \", split_rmse,\". \\nThis model uses two lines.  Before 3 we use \",\n",
    "      below_3_slope,\"duration + \", below_3_intercept, \"and after 3 it is \", above_3_slope, \"duration + \", above_3_intercept, \".\" )\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "celltoolbar": "Edit Metadata",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}