{
 "metadata": {
  "name": "",
  "signature": "sha256:06146f69c85129954d4afb17dc94627f42d8c42eb17f0989539e52c4b900ff2a"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "CS194-16 Introduction to Data Science\n",
      "\n",
      "**Downloading** To download this notebook from a browser in your VM, click on the download link at the top right corner of this page. \n",
      "\n",
      "**NOTE** click near here to select this cell, esc-Enter will get you into cell edit mode, shift-Enter gets you back\n",
      "\n",
      "**Name**: *Please put your name*\n",
      "\n",
      "**Student ID**: *Please put your student ID*\n",
      "\n",
      "\n",
      "Homework 2: Exploratory Data Analysis\n",
      "===\n",
      "\n",
      "## Overview\n",
      "\n",
      "Exploratory Data Analysis (EDA) is the process of examining and visualizing a novel dataset to understand its characteristics and patterns, before attempting more formal analysis. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### The Dataset\n",
      "\n",
      "that we'll use can be found at:\n",
      "\n",
      "https://archive.ics.uci.edu/ml/datasets/Abalone\n",
      "\n",
      "Its a dataset containing various attributes of Abalone specimens, in particular the number of \"rings\" (last column) that shows the approximate age of the specimen. The dataset is typically used to predict number of rings from other attributes.\n",
      "\n",
      "The data directory contains these files:\n",
      "\n",
      "* **abalone.data**, A csv file with data on a number of abalone specimens.\n",
      "* **abalone.names**, A text file with background information on the dataset.\n",
      "\n",
      "Create a HW2 directory on your VM and download this data into it. "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Deliverables\n",
      "\n",
      "Complete the all the exercises below and turn in a write up in the form of an IPython notebook, that is, **an .ipynb file**.\n",
      "The write up should include your code, answers to exercise questions, and plots of results.\n",
      "The submission will be as an assignment on bcourses with this file (after your edits) as an attachment. \n",
      "\n",
      "You can use this notebook and fill in answers inline, or if you prefer, do your write up in a separate notebook.\n",
      "Don't forget to include answers to questions that ask for natural language responses, i.e., in English, not code!\n",
      "\n",
      "We would prefer to test some of your code automatically, so please try to submit a notebook that uses the function names requested by the questions and that can be executed with \"Cell > Run all\"."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Guidelines\n",
      "\n",
      "#### Code\n",
      "\n",
      "This assignment can be done with basic python and matplotlib.\n",
      "Feel free to use PANDAs, too, which you may find well suited to several exercises.\n",
      "As for other libraries, please check with course staff whether they're allowed.\n",
      "In general, we want you to use whatever is comfortable, except for libraries that include functionality covered in the assignment.\n",
      "\n",
      "You're not required to do your coding in IPython, so feel free to use your favorite editor or IDE.\n",
      "But when you're done, remember to put your code into a notebook for your write up.\n",
      "\n",
      "#### Collaboration\n",
      "\n",
      "This assignment is to be done individually.  Everyone should be getting a hands on experience in this course.  You are free to discuss course material with fellow students, and we encourage you to use Internet resources to aid your understanding, but the work you turn in, including all code and answers, must be your own work."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Part 0: Reading\n",
      "\n",
      "### Exercise 0\n",
      "\n",
      "Step 0 is to read the dataset. First download it from the link above, and save it into a data directory such as the path in the cell below. Look at the first few lines of the file. Notice that most columns are numeric, but the first collumn is string with one of three values (gender). \n",
      "\n",
      "Now construct two versions of the data table. First produce a variable 'abalone_raw' which is a list of records, and each record should be a list of strings. Now construct the variable 'abalone' which is list of list of numbers from it by parsing the numeric strings to float values. For the first column, map the string values to numeric ones and create a dictionary and inverse dictionary to map between the string values and numeric values. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Please preserve the format of this line so we can use it for automated testing.\n",
      "DATA_PATH = \"/home/datascience/HWs/HW2/data/\" # Make this the /path/to/the/data`\n",
      "\n",
      "import csv\n",
      "# TODO Load data files here...\n",
      "def loaddatafile(fname):\n",
      "\n",
      "def rawtodata(table): # convert the string table to a numeric one, and return dicts\n",
      "\n",
      "abalone_raw = loaddatafile(DATA_PATH + \"abalone.data\")\n",
      "abalone,adict,alkup = rawtodata(abalone_raw)\n",
      "adict                 # check the string -> number map for the first column"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Part 1: Basic Statistics"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Create a list of the column names for this dataset from the Dataset description. Preserve the case and the spaces in these names:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "colnames=[]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now create a dictionary 'coldict' mapping column name to column, and use this to define a \"getcol\" function which returns a named column from the abalone table."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "coldict = \n",
      "\n",
      "def getcol(colname):\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: What is the min, max, average and std deviation of the Height column?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: Now create a 9 x 5 table. Each row of the table should be a column name followed by the values of min, max, mean and std for that column."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "summaries = []\n",
      "    \n",
      "summaries"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: List anything interesting about these values (this is an open-ended question)"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Part 2: Histograms"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: Now create a 3x3 grid of histograms, one for each column. Make sure your figure is large enough (should consume most of the width of the page). We recommend you use pylab, and its 'subplots' function. Include the column name as a title above each subfigure. Try to use loops rather than enumerating all 9 column names."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pylab\n",
      "%matplotlib inline"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: Which of the column data are skewed and in which direction? "
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Part 3: Scatter plots"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: Now ceate a grid of scatter plots for each column vs the \"Rings\" column. Use color to distinguish the sex of the specimen in each plot. Make titles of the form \"<colname> vs Rings\". Its fine to include \"Rings vs Rings\" as the last plot."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rings = getcol(\"Rings\")\n",
      "sex = getcol(\"Sex\")\n",
      "# TODO create the 3x3 grid of scatter plots"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: Do you notice any issues with the dataset? e.g. outliers?"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Part 4: Regression lines"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Add regression lines to the scatter plots above as per lab 2."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# TODO code to generate scatter plots with regression lines"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Part 5: Prediction Error"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: Next we would like to explore prediction, and find the feature that gives the best (lowest error) predictions of number of rings. You can do this with polyfit, once again predicting the Rings feature from one of the others, by adding an option to return the \"residual\" of the fit, which is a measure of its prediction error. Read the documentation for polyfit on how to do this. Then make a 3 x 3 array of residuals. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "residuals = np.zeros([3,3])\n",
      "# TODO get the residuals returned by polyfit\n",
      "\n",
      "residuals"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: What feature gives the smallest residual (other than Rings of course)?"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The residuals are sums of the squared error for all the predictions. A more useful measure is the RMS (root-mean-squared) distance for each point. This is an estimate of how far the actual rings count for a specimen is from its prediction. From the residuals above, compute the RMS value for each residual. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "rms_residuals = # TODO\n",
      "rms_residuals"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Part 6: Significance"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So far we have studied prediction without worrying about chance. The linear regression coefficient between any two data sequences of the same size will normally be non-zero due to noise. This suggests that one sequence \"predicts\" the other. e.g. pick a random woman and man from a room, then their ages are almost surely different. The age and gender attributes predict each other perfectly on this sample, but the direction of influence is completely arbitrary! Obviously this doesnt generalize.\n",
      "\n",
      "Statistical tests measure the likelihood that an observation may be due to chance if there is no \"real\" influence between two variables. The probability of the observations due to chance when there is no influence is called a p-value. You want this probability to be small, say less than 0.01. \n",
      "\n",
      "> TODO: Use the 'lingress' function from scipy.stats to perform linear fits between each data column and the rings column. Save the pvalues it returns for each fit into a 3 x 3 array. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from scipy.stats import linregress\n",
      "pvalues = np.zeros([3,3])\n",
      "# TODO: fill in the pvalues array\n",
      "\n",
      "pvalues"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> TODO: Are all the p-values less than 0.01 ?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Submission###\n",
      "\n",
      "Please use this link to submit this notebook on bcourses:\n",
      "https://bcourses.berkeley.edu/courses/1377158/assignments/6675873"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}