{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lesson: Let's preprocess the data!\n",
    "\n",
    "## File I/O and exploring the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "f = '../data/census_data.csv'\n",
    "\n",
    "# read in the file\n",
    "\n",
    "\n",
    "# look at the first 5 lines\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# What are the column names?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Sort the DataFrame by age and print out the last 5 lines\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selecting columns and rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# create a subset of the data with the columns education, occupation, hours_per_week. Look at the first 5 rows.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# create a subset of the data with rows 50-100 and the columns work_class and race. Look at the first 5 rows.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# create a subset of the data where education_num is greater than 8 and where sex is equal to Female. Look at the first 5 rows\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## groupby"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Group by work_class and output the group names (hint: add .keys() to the end of your line of code).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Let's group by work_class and use the mean as the aggregate function\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pivoting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Let's pivot on education_num and sex, with hours_per_week as the values and mean as the aggfunc\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Create a pivot table of your choosing, with any columns for rows and cols and a numerical columns for values.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Let's do the following:\n",
    "\n",
    "For the machine learning section, can you extract a subset of the data where:\n",
    "\n",
    "- `native_country` equals `United-States`\n",
    "- `hours_per_week` is greater than 10\n",
    "- `age` is greater than 20 and less than 50\n",
    "- `education` is `Masters`\n",
    "\n",
    "It's going to be a bunch of boolean indexing!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Store that new dataframe in new_df and print out the first five rows.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Split the DataFrame so that all of the columns except the last one are in new_df_data, and the last one is in new_df_labels.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# For those using IPython Notebook/Wakari/NBViewer: Go to the [scikit_learn](scikit_learn.ipynb) notebook!\n",
    "\n",
    "(For those using code files, please go to scikit_learn_materials.py!)\n",
    "\n",
    "###End of Pandas section.\n",
    "\n",
    "# Lesson: let's classify our data!\n",
    "\n",
    "## Final preprocessing touches\n",
    "\n",
    "Scikit-learn estimators take in continuous data only, which means that we'll have to transform our categorical data into something the scikit-learn estimators can handle. This is actually much easier than it sounds! We're going to convert our dataframe into a dictionary, and then encode the data in that dictionary into arrays of 1s and 0s.\n",
    "\n",
    "Let's first transform the `DataFrame` into a dictionary. We first have to transpose our `DataFrame`, so there is one row per nested dictionary. Finally, we'll put each item into a list, because scikit-learn's `DictVectorizer` object takes a list of dictionaries to encode. We also only need the values from our list of dictionaries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "new_df_transpose = new_df_data.transpose()\n",
    "\n",
    "data_into_dict = new_df_transpose.to_dict()\n",
    "census_data = [v for k, v in data_into_dict.iteritems()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's encode those features and instances."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction import DictVectorizer\n",
    "\n",
    "dv = DictVectorizer()\n",
    "transformed_data = dv.fit_transform(census_data).toarray()\n",
    "transformed_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we've done that, let's encode the labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import LabelEncoder\n",
    "\n",
    "le = LabelEncoder()\n",
    "transformed_labels = le.fit_transform(new_df_labels)\n",
    "transformed_labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we've done that, can you separate the `transformed_data` and `transformed_labels` into training and test sets?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.cross_validation import train_test_split\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's fit and predict all three classifiers to this data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.naive_bayes import GaussianNB\n",
    "\n",
    "# fit all\n",
    "\n",
    "\n",
    "# predict all\n",
    "\n",
    "\n",
    "# Output kNN predictions on first 10 labels\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Output decision tree predictions on first 10 labels\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Output NB predictions on first 10 labels\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's cross validate each."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Run cross-validation on kNN. Set cv=5\n",
    "from sklearn.cross_validation import cross_val_score\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Run cross-validation on NB. Set cv=5\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# For those using IPython Notebook/Wakari/NBViewer: Go to the [matplotlib](matplotlib.ipynb) notebook!\n",
    "\n",
    "(For those using code files, please go to matplotlib_materials.py!)\n",
    "\n",
    "### End of scikit-learn section.\n",
    "\n",
    "# Lesson: what can plots tell us about our data and results?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Let's look at the distribution of classes in our dataset. Can you get the value_counts of new_df_labels and make a bar chart?\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Now, can you make a bar chart comparing the predicted kNN labels to the actual labels? You can use the first 10 only if you want.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# What other visualizations would be helpful? Can you come up with a few more?\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}