{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Adult census data from 1996\n", "======\n", "\n", "In this notebook we will have a look at the [census income data from 1996](http://archive.ics.uci.edu/ml/datasets/Census+Income). The dataset can be found is part of the [UC Irvines ML repository](http://archive.ics.uci.edu/ml/). Specifically the data can be found at http://archive.ics.uci.edu/ml/machine-learning-databases/adult/." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python imports\n", "------" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "%matplotlib inline\n", "matplotlib.style.use('ggplot')\n", "matplotlib.rc_params_from_file(\"../styles/matplotlibrc\" ).update()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data cleaning importing an munging\n", "------\n", "\n", "The data is a set of csv files that have been split into a training set and test set already. To support easy read in to a pandas DF we preprocess the files by adding a header that we get from the datasets metadata and adding an index column" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import csv\n", "\n", "def indexAndAnnotateDataSet(filename):\n", " outname = filename.replace('.csv', '_clean.csv')\n", " with open(outname, 'wb') as csvfile:\n", " f = open(filename,'r')\n", " filewriter = csv.writer(csvfile, delimiter=',')\n", " columns = ['', 'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'target']\n", " filewriter.writerow(columns)\n", " idx = 0\n", " for line in f:\n", " lst = []\n", " lst.append(str(idx))\n", " for item in line.strip('.').split(','):\n", " lst.append(item.strip().strip('.'))\n", "\n", " filewriter = csv.writer(csvfile, delimiter=',')\n", " filewriter.writerow(lst)\n", " idx +=1\n", " f.close()\n", " return outname" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since the datasets are fairly small we download them here" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import os\n", "import urllib\n", "\n", "\n", "url_train_dataset = 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'\n", "urllib.urlretrieve(url_train_dataset, 'adult.csv')\n", "\n", "url_test_dataset = 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'\n", "urllib.urlretrieve(url_test_dataset, 'adult_test.csv')\n", "\n", "clean_DS_1 = indexAndAnnotateDataSet(os.getcwd() +'/adult.csv')\n", "clean_DS_2 = indexAndAnnotateDataSet(os.getcwd() +'/adult_test.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use scikit-learn's built-in cross-validation methodes. Hence, we join the training and test data into a larger data frame to increase the size of the dataset. Moreover, we will drop rows that contain `na` feature and the column called 'fnlwgt' as this is a control value introduced by the original authors of the dataset.\n", "\n", "Inhaling the csv files and appending the data frames we get:" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | age | \n", "workclass | \n", "education | \n", "education-num | \n", "marital-status | \n", "occupation | \n", "relationship | \n", "race | \n", "sex | \n", "capital-gain | \n", "capital-loss | \n", "hours-per-week | \n", "native-country | \n", "target | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "39 | \n", "State-gov | \n", "Bachelors | \n", "13 | \n", "Never-married | \n", "Adm-clerical | \n", "Not-in-family | \n", "White | \n", "Male | \n", "2174 | \n", "0 | \n", "40 | \n", "United-States | \n", "<=50K | \n", "
1 | \n", "50 | \n", "Self-emp-not-inc | \n", "Bachelors | \n", "13 | \n", "Married-civ-spouse | \n", "Exec-managerial | \n", "Husband | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "13 | \n", "United-States | \n", "<=50K | \n", "
2 | \n", "38 | \n", "Private | \n", "HS-grad | \n", "9 | \n", "Divorced | \n", "Handlers-cleaners | \n", "Not-in-family | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "<=50K | \n", "
3 | \n", "53 | \n", "Private | \n", "11th | \n", "7 | \n", "Married-civ-spouse | \n", "Handlers-cleaners | \n", "Husband | \n", "Black | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "<=50K | \n", "
4 | \n", "28 | \n", "Private | \n", "Bachelors | \n", "13 | \n", "Married-civ-spouse | \n", "Prof-specialty | \n", "Wife | \n", "Black | \n", "Female | \n", "0 | \n", "0 | \n", "40 | \n", "Cuba | \n", "<=50K | \n", "
5 | \n", "37 | \n", "Private | \n", "Masters | \n", "14 | \n", "Married-civ-spouse | \n", "Exec-managerial | \n", "Wife | \n", "White | \n", "Female | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "<=50K | \n", "
6 | \n", "49 | \n", "Private | \n", "9th | \n", "5 | \n", "Married-spouse-absent | \n", "Other-service | \n", "Not-in-family | \n", "Black | \n", "Female | \n", "0 | \n", "0 | \n", "16 | \n", "Jamaica | \n", "<=50K | \n", "
7 | \n", "52 | \n", "Self-emp-not-inc | \n", "HS-grad | \n", "9 | \n", "Married-civ-spouse | \n", "Exec-managerial | \n", "Husband | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "45 | \n", "United-States | \n", ">50K | \n", "
8 | \n", "31 | \n", "Private | \n", "Masters | \n", "14 | \n", "Never-married | \n", "Prof-specialty | \n", "Not-in-family | \n", "White | \n", "Female | \n", "14084 | \n", "0 | \n", "50 | \n", "United-States | \n", ">50K | \n", "
9 | \n", "42 | \n", "Private | \n", "Bachelors | \n", "13 | \n", "Married-civ-spouse | \n", "Exec-managerial | \n", "Husband | \n", "White | \n", "Male | \n", "5178 | \n", "0 | \n", "40 | \n", "United-States | \n", ">50K | \n", "
10 | \n", "37 | \n", "Private | \n", "Some-college | \n", "10 | \n", "Married-civ-spouse | \n", "Exec-managerial | \n", "Husband | \n", "Black | \n", "Male | \n", "0 | \n", "0 | \n", "80 | \n", "United-States | \n", ">50K | \n", "
11 | \n", "30 | \n", "State-gov | \n", "Bachelors | \n", "13 | \n", "Married-civ-spouse | \n", "Prof-specialty | \n", "Husband | \n", "Asian-Pac-Islander | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "India | \n", ">50K | \n", "
12 | \n", "23 | \n", "Private | \n", "Bachelors | \n", "13 | \n", "Never-married | \n", "Adm-clerical | \n", "Own-child | \n", "White | \n", "Female | \n", "0 | \n", "0 | \n", "30 | \n", "United-States | \n", "<=50K | \n", "
13 | \n", "32 | \n", "Private | \n", "Assoc-acdm | \n", "12 | \n", "Never-married | \n", "Sales | \n", "Not-in-family | \n", "Black | \n", "Male | \n", "0 | \n", "0 | \n", "50 | \n", "United-States | \n", "<=50K | \n", "
14 | \n", "40 | \n", "Private | \n", "Assoc-voc | \n", "11 | \n", "Married-civ-spouse | \n", "Craft-repair | \n", "Husband | \n", "Asian-Pac-Islander | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "? | \n", ">50K | \n", "
15 | \n", "34 | \n", "Private | \n", "7th-8th | \n", "4 | \n", "Married-civ-spouse | \n", "Transport-moving | \n", "Husband | \n", "Amer-Indian-Eskimo | \n", "Male | \n", "0 | \n", "0 | \n", "45 | \n", "Mexico | \n", "<=50K | \n", "
16 | \n", "25 | \n", "Self-emp-not-inc | \n", "HS-grad | \n", "9 | \n", "Never-married | \n", "Farming-fishing | \n", "Own-child | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "35 | \n", "United-States | \n", "<=50K | \n", "
17 | \n", "32 | \n", "Private | \n", "HS-grad | \n", "9 | \n", "Never-married | \n", "Machine-op-inspct | \n", "Unmarried | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "<=50K | \n", "
18 | \n", "38 | \n", "Private | \n", "11th | \n", "7 | \n", "Married-civ-spouse | \n", "Sales | \n", "Husband | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "50 | \n", "United-States | \n", "<=50K | \n", "
19 | \n", "43 | \n", "Self-emp-not-inc | \n", "Masters | \n", "14 | \n", "Divorced | \n", "Exec-managerial | \n", "Unmarried | \n", "White | \n", "Female | \n", "0 | \n", "0 | \n", "45 | \n", "United-States | \n", ">50K | \n", "
20 | \n", "40 | \n", "Private | \n", "Doctorate | \n", "16 | \n", "Married-civ-spouse | \n", "Prof-specialty | \n", "Husband | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "60 | \n", "United-States | \n", ">50K | \n", "
21 | \n", "54 | \n", "Private | \n", "HS-grad | \n", "9 | \n", "Separated | \n", "Other-service | \n", "Unmarried | \n", "Black | \n", "Female | \n", "0 | \n", "0 | \n", "20 | \n", "United-States | \n", "<=50K | \n", "
22 | \n", "35 | \n", "Federal-gov | \n", "9th | \n", "5 | \n", "Married-civ-spouse | \n", "Farming-fishing | \n", "Husband | \n", "Black | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "<=50K | \n", "
23 | \n", "43 | \n", "Private | \n", "11th | \n", "7 | \n", "Married-civ-spouse | \n", "Transport-moving | \n", "Husband | \n", "White | \n", "Male | \n", "0 | \n", "2042 | \n", "40 | \n", "United-States | \n", "<=50K | \n", "
24 | \n", "59 | \n", "Private | \n", "HS-grad | \n", "9 | \n", "Divorced | \n", "Tech-support | \n", "Unmarried | \n", "White | \n", "Female | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "<=50K | \n", "
25 | \n", "56 | \n", "Local-gov | \n", "Bachelors | \n", "13 | \n", "Married-civ-spouse | \n", "Tech-support | \n", "Husband | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", ">50K | \n", "
26 | \n", "19 | \n", "Private | \n", "HS-grad | \n", "9 | \n", "Never-married | \n", "Craft-repair | \n", "Own-child | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "<=50K | \n", "
27 | \n", "54 | \n", "? | \n", "Some-college | \n", "10 | \n", "Married-civ-spouse | \n", "? | \n", "Husband | \n", "Asian-Pac-Islander | \n", "Male | \n", "0 | \n", "0 | \n", "60 | \n", "South | \n", ">50K | \n", "
28 | \n", "39 | \n", "Private | \n", "HS-grad | \n", "9 | \n", "Divorced | \n", "Exec-managerial | \n", "Not-in-family | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "80 | \n", "United-States | \n", "<=50K | \n", "
29 | \n", "49 | \n", "Private | \n", "HS-grad | \n", "9 | \n", "Married-civ-spouse | \n", "Craft-repair | \n", "Husband | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "<=50K | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
32531 | \n", "30 | \n", "? | \n", "Bachelors | \n", "13 | \n", "Never-married | \n", "? | \n", "Not-in-family | \n", "Asian-Pac-Islander | \n", "Female | \n", "0 | \n", "0 | \n", "99 | \n", "United-States | \n", "<=50K | \n", "
32532 | \n", "34 | \n", "Private | \n", "Doctorate | \n", "16 | \n", "Married-civ-spouse | \n", "Prof-specialty | \n", "Husband | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "60 | \n", "United-States | \n", ">50K | \n", "
32533 | \n", "54 | \n", "Private | \n", "Bachelors | \n", "13 | \n", "Married-civ-spouse | \n", "Exec-managerial | \n", "Husband | \n", "Asian-Pac-Islander | \n", "Male | \n", "0 | \n", "0 | \n", "50 | \n", "Japan | \n", ">50K | \n", "
32534 | \n", "37 | \n", "Private | \n", "Some-college | \n", "10 | \n", "Divorced | \n", "Adm-clerical | \n", "Unmarried | \n", "White | \n", "Female | \n", "0 | \n", "0 | \n", "39 | \n", "United-States | \n", "<=50K | \n", "
32535 | \n", "22 | \n", "Private | \n", "12th | \n", "8 | \n", "Never-married | \n", "Protective-serv | \n", "Own-child | \n", "Black | \n", "Male | \n", "0 | \n", "0 | \n", "35 | \n", "United-States | \n", "<=50K | \n", "
32536 | \n", "34 | \n", "Private | \n", "Bachelors | \n", "13 | \n", "Never-married | \n", "Exec-managerial | \n", "Not-in-family | \n", "White | \n", "Female | \n", "0 | \n", "0 | \n", "55 | \n", "United-States | \n", ">50K | \n", "
32537 | \n", "30 | \n", "Private | \n", "HS-grad | \n", "9 | \n", "Never-married | \n", "Craft-repair | \n", "Not-in-family | \n", "Black | \n", "Male | \n", "0 | \n", "0 | \n", "46 | \n", "United-States | \n", "<=50K | \n", "
32538 | \n", "38 | \n", "Private | \n", "Bachelors | \n", "13 | \n", "Divorced | \n", "Prof-specialty | \n", "Unmarried | \n", "Black | \n", "Female | \n", "15020 | \n", "0 | \n", "45 | \n", "United-States | \n", ">50K | \n", "
32539 | \n", "71 | \n", "? | \n", "Doctorate | \n", "16 | \n", "Married-civ-spouse | \n", "? | \n", "Husband | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "10 | \n", "United-States | \n", ">50K | \n", "
32540 | \n", "45 | \n", "State-gov | \n", "HS-grad | \n", "9 | \n", "Separated | \n", "Adm-clerical | \n", "Own-child | \n", "White | \n", "Female | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "<=50K | \n", "
32541 | \n", "41 | \n", "? | \n", "HS-grad | \n", "9 | \n", "Separated | \n", "? | \n", "Not-in-family | \n", "Black | \n", "Female | \n", "0 | \n", "0 | \n", "32 | \n", "United-States | \n", "<=50K | \n", "
32542 | \n", "72 | \n", "? | \n", "HS-grad | \n", "9 | \n", "Married-civ-spouse | \n", "? | \n", "Husband | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "25 | \n", "United-States | \n", "<=50K | \n", "
32543 | \n", "45 | \n", "Local-gov | \n", "Assoc-acdm | \n", "12 | \n", "Divorced | \n", "Prof-specialty | \n", "Unmarried | \n", "White | \n", "Female | \n", "0 | \n", "0 | \n", "48 | \n", "United-States | \n", "<=50K | \n", "
32544 | \n", "31 | \n", "Private | \n", "Masters | \n", "14 | \n", "Divorced | \n", "Other-service | \n", "Not-in-family | \n", "Other | \n", "Female | \n", "0 | \n", "0 | \n", "30 | \n", "United-States | \n", "<=50K | \n", "
32545 | \n", "39 | \n", "Local-gov | \n", "Assoc-acdm | \n", "12 | \n", "Married-civ-spouse | \n", "Adm-clerical | \n", "Wife | \n", "White | \n", "Female | \n", "0 | \n", "0 | \n", "20 | \n", "United-States | \n", ">50K | \n", "
32546 | \n", "37 | \n", "Private | \n", "Assoc-acdm | \n", "12 | \n", "Divorced | \n", "Tech-support | \n", "Not-in-family | \n", "White | \n", "Female | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "<=50K | \n", "
32547 | \n", "43 | \n", "Private | \n", "HS-grad | \n", "9 | \n", "Married-civ-spouse | \n", "Machine-op-inspct | \n", "Husband | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "Mexico | \n", "<=50K | \n", "
32548 | \n", "65 | \n", "Self-emp-not-inc | \n", "Prof-school | \n", "15 | \n", "Never-married | \n", "Prof-specialty | \n", "Not-in-family | \n", "White | \n", "Male | \n", "1086 | \n", "0 | \n", "60 | \n", "United-States | \n", "<=50K | \n", "
32549 | \n", "43 | \n", "State-gov | \n", "Some-college | \n", "10 | \n", "Divorced | \n", "Adm-clerical | \n", "Other-relative | \n", "White | \n", "Female | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "<=50K | \n", "
32550 | \n", "43 | \n", "Self-emp-not-inc | \n", "Some-college | \n", "10 | \n", "Married-civ-spouse | \n", "Craft-repair | \n", "Husband | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "50 | \n", "United-States | \n", "<=50K | \n", "
32551 | \n", "32 | \n", "Private | \n", "10th | \n", "6 | \n", "Married-civ-spouse | \n", "Handlers-cleaners | \n", "Husband | \n", "Amer-Indian-Eskimo | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "<=50K | \n", "
32552 | \n", "43 | \n", "Private | \n", "Assoc-voc | \n", "11 | \n", "Married-civ-spouse | \n", "Sales | \n", "Husband | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "45 | \n", "United-States | \n", "<=50K | \n", "
32553 | \n", "32 | \n", "Private | \n", "Masters | \n", "14 | \n", "Never-married | \n", "Tech-support | \n", "Not-in-family | \n", "Asian-Pac-Islander | \n", "Male | \n", "0 | \n", "0 | \n", "11 | \n", "Taiwan | \n", "<=50K | \n", "
32554 | \n", "53 | \n", "Private | \n", "Masters | \n", "14 | \n", "Married-civ-spouse | \n", "Exec-managerial | \n", "Husband | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", ">50K | \n", "
32555 | \n", "22 | \n", "Private | \n", "Some-college | \n", "10 | \n", "Never-married | \n", "Protective-serv | \n", "Not-in-family | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "<=50K | \n", "
32556 | \n", "27 | \n", "Private | \n", "Assoc-acdm | \n", "12 | \n", "Married-civ-spouse | \n", "Tech-support | \n", "Wife | \n", "White | \n", "Female | \n", "0 | \n", "0 | \n", "38 | \n", "United-States | \n", "<=50K | \n", "
32557 | \n", "40 | \n", "Private | \n", "HS-grad | \n", "9 | \n", "Married-civ-spouse | \n", "Machine-op-inspct | \n", "Husband | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", ">50K | \n", "
32558 | \n", "58 | \n", "Private | \n", "HS-grad | \n", "9 | \n", "Widowed | \n", "Adm-clerical | \n", "Unmarried | \n", "White | \n", "Female | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "<=50K | \n", "
32559 | \n", "22 | \n", "Private | \n", "HS-grad | \n", "9 | \n", "Never-married | \n", "Adm-clerical | \n", "Own-child | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "20 | \n", "United-States | \n", "<=50K | \n", "
32560 | \n", "52 | \n", "Self-emp-inc | \n", "HS-grad | \n", "9 | \n", "Married-civ-spouse | \n", "Exec-managerial | \n", "Wife | \n", "White | \n", "Female | \n", "15024 | \n", "0 | \n", "40 | \n", "United-States | \n", ">50K | \n", "
32561 rows × 14 columns
\n", "\n", " | education | \n", "education-num | \n", "
---|---|---|
224 | \n", "Preschool | \n", "1 | \n", "
160 | \n", "1st-4th | \n", "2 | \n", "
56 | \n", "5th-6th | \n", "3 | \n", "
15 | \n", "7th-8th | \n", "4 | \n", "
6 | \n", "9th | \n", "5 | \n", "
77 | \n", "10th | \n", "6 | \n", "
3 | \n", "11th | \n", "7 | \n", "
415 | \n", "12th | \n", "8 | \n", "
2 | \n", "HS-grad | \n", "9 | \n", "
10 | \n", "Some-college | \n", "10 | \n", "
14 | \n", "Assoc-voc | \n", "11 | \n", "
13 | \n", "Assoc-acdm | \n", "12 | \n", "
0 | \n", "Bachelors | \n", "13 | \n", "
5 | \n", "Masters | \n", "14 | \n", "
52 | \n", "Prof-school | \n", "15 | \n", "
20 | \n", "Doctorate | \n", "16 | \n", "