{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lesson: Let's preprocess the data!\n", "\n", "## File I/O and exploring the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "f = '../data/census_data.csv'\n", "\n", "# read in the file\n", "\n", "\n", "# look at the first 5 lines\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# What are the column names?\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Sort the DataFrame by age and print out the last 5 lines\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting columns and rows" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# create a subset of the data with the columns education, occupation, hours_per_week. Look at the first 5 rows.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# create a subset of the data with rows 50-100 and the columns work_class and race. Look at the first 5 rows.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# create a subset of the data where education_num is greater than 8 and where sex is equal to Female. Look at the first 5 rows\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## groupby" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Group by work_class and output the group names (hint: add .keys() to the end of your line of code).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Let's group by work_class and use the mean as the aggregate function\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pivoting" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Let's pivot on education_num and sex, with hours_per_week as the values and mean as the aggfunc\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Create a pivot table of your choosing, with any columns for rows and cols and a numerical columns for values.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's do the following:\n", "\n", "For the machine learning section, can you extract a subset of the data where:\n", "\n", "- `native_country` equals `United-States`\n", "- `hours_per_week` is greater than 10\n", "- `age` is greater than 20 and less than 50\n", "- `education` is `Masters`\n", "\n", "It's going to be a bunch of boolean indexing!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Store that new dataframe in new_df and print out the first five rows.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Split the DataFrame so that all of the columns except the last one are in new_df_data, and the last one is in new_df_labels.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# For those using IPython Notebook/Wakari/NBViewer: Go to the [scikit_learn](scikit_learn.ipynb) notebook!\n", "\n", "(For those using code files, please go to scikit_learn_materials.py!)\n", "\n", "###End of Pandas section.\n", "\n", "# Lesson: let's classify our data!\n", "\n", "## Final preprocessing touches\n", "\n", "Scikit-learn estimators take in continuous data only, which means that we'll have to transform our categorical data into something the scikit-learn estimators can handle. This is actually much easier than it sounds! We're going to convert our dataframe into a dictionary, and then encode the data in that dictionary into arrays of 1s and 0s.\n", "\n", "Let's first transform the `DataFrame` into a dictionary. We first have to transpose our `DataFrame`, so there is one row per nested dictionary. Finally, we'll put each item into a list, because scikit-learn's `DictVectorizer` object takes a list of dictionaries to encode. We also only need the values from our list of dictionaries." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "new_df_transpose = new_df_data.transpose()\n", "\n", "data_into_dict = new_df_transpose.to_dict()\n", "census_data = [v for k, v in data_into_dict.iteritems()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's encode those features and instances." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.feature_extraction import DictVectorizer\n", "\n", "dv = DictVectorizer()\n", "transformed_data = dv.fit_transform(census_data).toarray()\n", "transformed_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we've done that, let's encode the labels." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.preprocessing import LabelEncoder\n", "\n", "le = LabelEncoder()\n", "transformed_labels = le.fit_transform(new_df_labels)\n", "transformed_labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we've done that, can you separate the `transformed_data` and `transformed_labels` into training and test sets?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.cross_validation import train_test_split\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's fit and predict all three classifiers to this data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.naive_bayes import GaussianNB\n", "\n", "# fit all\n", "\n", "\n", "# predict all\n", "\n", "\n", "# Output kNN predictions on first 10 labels\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Output decision tree predictions on first 10 labels\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Output NB predictions on first 10 labels\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's cross validate each." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Run cross-validation on kNN. Set cv=5\n", "from sklearn.cross_validation import cross_val_score\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Run cross-validation on NB. Set cv=5\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# For those using IPython Notebook/Wakari/NBViewer: Go to the [matplotlib](matplotlib.ipynb) notebook!\n", "\n", "(For those using code files, please go to matplotlib_materials.py!)\n", "\n", "### End of scikit-learn section.\n", "\n", "# Lesson: what can plots tell us about our data and results?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Let's look at the distribution of classes in our dataset. Can you get the value_counts of new_df_labels and make a bar chart?\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Now, can you make a bar chart comparing the predicted kNN labels to the actual labels? You can use the first 10 only if you want.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# What other visualizations would be helpful? Can you come up with a few more?\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.9" } }, "nbformat": 4, "nbformat_minor": 0 }